L2/00-271

A Few Remarks Concerning the Math Alphanumerics

Murray Sargent III
murrays@microsoft.com
August 11, 2000

This paper summarizes some recent ideas that appeared in email on the use of
the math alphanumerics in plain and marked-up text.  The first section gives
some advantages of having these characters in Unicode and answers the
question as to how to deal with similar characters that don't appear in the
Plane 1 math alphanumerics block. Section 2 explains why it would be nice to
have the math alphabetics in computational computer programming languages
such as C. Section 3 discusses the use of the math alphanumerics in mark up
languages such as TeX and MathML.  The final section discusses the gray line
between mark up and plain text.


1. Advantages of having the math alphanumerics in Unicode

An advantage of the math alphanumerics being in Unicode is that all search
engines can find them, not just XML search engines or other specialized
high-level engines.  Sometimes people forget (now increasingly often where I
work :-) that there's more to the world than XML, however useful XML is.
The idea of limiting math-oriented searches to math-oriented text runs is a
neat idea that can be applied in XML contexts like MathML or in other
higher-level formatting schemes that identify runs of math text with a math
attribute.  In particular, the math alphanumerics are math by nature and so
in searching for them one might be able to improve performance by limiting
the search to text runs (or trees) with a math attribute.

One person noted that there isn't any standard in mathematics that
associates fonts with meanings. While this is most certainly true, it misses
the point. The point is that in a given paper a script L, for example, has a
different meaning than an italic L.  Different papers often attach
completely independent meanings to either.  But in any given paper, it's
very desirable to be able to distinguish between the two in a plain-text
search so that searching for script L doesn't get an italic L hit by
mistake.

A question arose as to how to represent bold symbols that aren't in the math
alphanumerics. The answer is to use mark up and the appropriate BMP
characters.  The math alphanumerics are only designed to catch 98% (or more)
of math variable names.  Markup isn't as convenient, but it's reserved to
handle the other 2%.


2. Use of Math Alphabetics in Computer Programs
 
The math alphanumerics could be used with relatively little implementation
effort in variable names in high-level languages.  Then if you code up a
mathematical formula, the variables that appear in the formula can appear
the same in the computer program. A key point is that the compiler should
display the desired characters in both edit and debug windows. A
preprocessor could translate MathML, for example, into C++ appropriately
ASCIIizing the variable names, but it won't be able to make the debug
windows use the math-oriented characters unless it can handle the underlying
Unicode characters.

The advantages of using the Unicode alphabetic characters in computer
program variables are at least threefold: 1) many formulas in document files
can be programmed simply by copying them into a program file and inserting
appropriate multiplication dots.  This dramatically reduces coding time and
errors.  This is true for formulas encoded using my plain-text Unicode
encoding of mathematics.  If you use TeX or MathML, the copy step needs to
run more code. 2) The use of the same notation in programs and the
associated journal articles and books leads to an unprecedented level of
self documentation.  Typically programs aren't documented as well as they
should be and this "free" documentation is very helpful. 3) In addition to
providing useful tools for the present, making programs appear more like the
mathematics that they represent should help us figure out how to accomplish
the ultimate goal of teaching computers to understand and use arbitrary
mathematical expressions.

As an example, for the last six years or so that I was pursuing theoretical
laser physics (in the late '80s and early '90s), I used my PS technical word
processor as a front end to C++ code that implemented the mathematical
formulae of my theories.  PS's character set had math script and italic
alphabets as well as Greek and many other math symbols.  It was incredibly
useful to have the same math characters in the C++ programs that I had in my
published papers. To code up the formulae, essentially all I had to do was
to cut & paste mathematical equations from my papers into the C++ programs,
add a few implied multiplication dots, and bingo, I was generating graphs
illustrating my theories. It's true that the cut & paste operation involved
"ASCIIizing" the math alphanumerics and that I didn't try to handle
constructions such as integral and summation (although one could do a good
job of this).  At various Unicode conferences, I've shown slides showing how
the math expressions in my C++ programs looked almost the same as those in
my published papers. For the sake of discussion, I include some of these
slides at the 17th Unicode Conference in San Jose in September, 2000. This
approach to coding my formulae was dramatically faster than the more
traditional approach of using variable names like alpha, scriptL or whatever
name one might use to ASCIIize the original mathematical variable names.  

In fact, computer programs are an excellent example where plain-text
encodings play important roles.  C++ compilers don't understand program code
written in XML or HTML, although doing so is an admittedly interesting
possible extension and work is going on to translate MathML expressions into
C++ code. Then you'll be able to cut & paste MathML expressions into C++,
but the expressions in the editor and debugger won't look anything like the
original mathematics. One of the neat by products of using the math
alphanumerics in computer languages (when and if) is that you'd see math
variables like italic i in both edit and debug modes. One of the biggest
disappointments for me when I learned how to program computers back in 1962
was how different the Fortran II code looked from the formulae I coded.
With Unicode and the math alphanumerics, that no longer needs to be true.

So I highly recommend that the math alphanumerics be allowed in variable
names along with lots of math symbols.  Java has made a great deal of
progress in this area and I really wish the C++ standard's committees would
do the same.  Most math expressions are amazingly close to viable computer
code provided the right notation is used.  Whitehead once said that 90% of
mathematics is notation and a perfect notation would be a substitute for
thought.  We're not there yet, nor will we ever be entirely.  But we can be
a lot closer. It's pretty incredible.


3. Use of Math Alphanumerics in Mark Up Languages

With regard to Kent Karlsson's objections about LaTeX going other
directions, that's the choice of the LaTeX implementers should they so
choose.  But from considerable personal experience I can vouch that TeX math
texts would be much more readable with the math alphanumerics than TeX texts
currently are.  With my old PS Word Processor, I did all of my derivations
literally on screen, not with pencil and paper.  I.e., complicated calculus
derivations all on screen!  You'd never be able to do this on screen in TeX,
since it's too verbose and you couldn't see the forest through the trees.
My Unicode conference papers have given examples of TeX with and without
Unicode and the math alphanumerics.  The relative clarity of the expressions
with Unicode and the math alphanumerics is simply stunning.

Let me finally emphasize that a major reason for including the math
alphanumerics in Unicode was to satisfy a requirement for the STIX group of
math communities, one of which is MathML, which is XML, not plain text.  So
we shouldn't say that the main reason for including them is for plain text
and we shouldn't recommend that they not be used in MathML text data.

I think that I and others have shown that the math alphanumerics work for
math although they clearly aren't the only way to represent mathematical
variables on computers.  One can certainly use markup of various kinds like
TeX and XML and some markup is still needed to handle alphanumeric
combinations not included in the proposed Plane 1 block.  But there are real
advantages to the math alphanumerics, not the least of which is a
commonality that transcends markup and plain text.


4. If you can do it with markup, then do it with markup

This maxim is used today sometimes with amazing fervor.  I'd like to carry
it to its logical conclusion and then make a remark about using the math
alphanumerics in the XML markup language MathML.  These thoughts are
pertinent to Unicode Technical Report #20 entitled "Unicode in XML and other
Markup Languages".

It's intriguing to note that with enough markup one could use a universal
character set that's substantially smaller than Unicode.  For example, the
only Latin characters would be a-z; upper-case letters, composite
characters, and combining-mark sequences in general would be marked up
versions of these base English letters.

In mathematics, all variations involving the idea of equality would be
marked up versions of the ASCII equal sign =.  So the not-equal sign
(U+2260) would be represented by something like <mathnot>=</mathnot>.
Unicode currently contains many (Ken can say how many) such symbols and they
all could be represented as marked-up versions of the single character code
U+003D.

The Indic character sets could largely be described with a single ISCII
repertoire appropriately marked up, and Greek could be derived from Latin
with appropriate markup (or vice versa to do history some justice).
Similarly Cyrillic could largely be marked up Latin and Greek or if Greek is
marked up Latin, then Cyrillic would be just marked-up Latin (some
characters would have to have more than one Latin base character).
 
All Korean Hangul symbols can be done with combining Jamos and most Han
characters could be represented as marked-up radicals.

Ultimately we'd only have to use, say, a set of 4000 base symbols to
represent virtually all living natural language and mathematical characters
now and most likely for centuries to come. The idea is really quite
intriguing and from a research point of view, it's probably worth while
pursuing.  But somehow this pushes a bit too far the gray line of where to
put markup relative to using enhanced characters.

So where should one put such a gray line?  Currently I suspect that everyone
would resist marking up Latin to get Greek and Cyrillic.  Many people would
point to the problems of marking up Chinese radicals to get Han characters.
Most people would resist using markup to represent combining-mark sequences
and composite characters. And, sigh, some people think that MathML should
resist using the new composite math alphanumerics in favor of the
corresponding marked-up Latin and Greek alphabets.

I'd argue that a real purist would use an upper-case attribute instead of
upper-case letters (there are obvious advantages to doing so), but would
probably allow (I'm conjecturing here, since I'm certainly no purist!) Greek
and Cyrillic to be used without treating them as marked-up Latin.  However,
a real purist would presumably also use mark up for all accented characters
and probably would use mark up for all equality related concepts in math.

Somehow I don't think many people on this list are so pure (Joe, Rick
maybe?); we're all sinners to some degree!  But the desirability of using
markup is just one virtue that should be balanced against others, such as
efficiency, convenience, portability, and compatibility.  Mark up certainly
isn't the only virtuous thing around (although you may have noticed in the
recent Forum 2000 that Microsoft "is betting the company" on XML).

From a mathematical software point of view, the math alphanumerics are just
as desirable as the not-equal sign: they're all just independent math
symbols.  The MathML engines that have been implemented to date use these
alphanumerics as such (in the PUA) and in general  aren't prepared to change
over to using corresponding XML tree structures.  XML frankly is very
verbose when used to mark up objects as small as characters and MathML
already suffers accordingly.  Not using the math alphanumerics in MathML
would add yet another level of overhead bloat to no particular advantage. So
from the points of view of efficiency, convenience, portability, and
compatibility, the math alphanumerics are desirable in MathML, which is why
the MathML community lobbied for their inclusion in Unicode in the first
place.