L2/00-271 A Few Remarks Concerning the Math Alphanumerics Murray Sargent III murrays@microsoft.com August 11, 2000 This paper summarizes some recent ideas that appeared in email on the use of the math alphanumerics in plain and marked-up text. The first section gives some advantages of having these characters in Unicode and answers the question as to how to deal with similar characters that don't appear in the Plane 1 math alphanumerics block. Section 2 explains why it would be nice to have the math alphabetics in computational computer programming languages such as C. Section 3 discusses the use of the math alphanumerics in mark up languages such as TeX and MathML. The final section discusses the gray line between mark up and plain text. 1. Advantages of having the math alphanumerics in Unicode An advantage of the math alphanumerics being in Unicode is that all search engines can find them, not just XML search engines or other specialized high-level engines. Sometimes people forget (now increasingly often where I work :-) that there's more to the world than XML, however useful XML is. The idea of limiting math-oriented searches to math-oriented text runs is a neat idea that can be applied in XML contexts like MathML or in other higher-level formatting schemes that identify runs of math text with a math attribute. In particular, the math alphanumerics are math by nature and so in searching for them one might be able to improve performance by limiting the search to text runs (or trees) with a math attribute. One person noted that there isn't any standard in mathematics that associates fonts with meanings. While this is most certainly true, it misses the point. The point is that in a given paper a script L, for example, has a different meaning than an italic L. Different papers often attach completely independent meanings to either. But in any given paper, it's very desirable to be able to distinguish between the two in a plain-text search so that searching for script L doesn't get an italic L hit by mistake. A question arose as to how to represent bold symbols that aren't in the math alphanumerics. The answer is to use mark up and the appropriate BMP characters. The math alphanumerics are only designed to catch 98% (or more) of math variable names. Markup isn't as convenient, but it's reserved to handle the other 2%. 2. Use of Math Alphabetics in Computer Programs The math alphanumerics could be used with relatively little implementation effort in variable names in high-level languages. Then if you code up a mathematical formula, the variables that appear in the formula can appear the same in the computer program. A key point is that the compiler should display the desired characters in both edit and debug windows. A preprocessor could translate MathML, for example, into C++ appropriately ASCIIizing the variable names, but it won't be able to make the debug windows use the math-oriented characters unless it can handle the underlying Unicode characters. The advantages of using the Unicode alphabetic characters in computer program variables are at least threefold: 1) many formulas in document files can be programmed simply by copying them into a program file and inserting appropriate multiplication dots. This dramatically reduces coding time and errors. This is true for formulas encoded using my plain-text Unicode encoding of mathematics. If you use TeX or MathML, the copy step needs to run more code. 2) The use of the same notation in programs and the associated journal articles and books leads to an unprecedented level of self documentation. Typically programs aren't documented as well as they should be and this "free" documentation is very helpful. 3) In addition to providing useful tools for the present, making programs appear more like the mathematics that they represent should help us figure out how to accomplish the ultimate goal of teaching computers to understand and use arbitrary mathematical expressions. As an example, for the last six years or so that I was pursuing theoretical laser physics (in the late '80s and early '90s), I used my PS technical word processor as a front end to C++ code that implemented the mathematical formulae of my theories. PS's character set had math script and italic alphabets as well as Greek and many other math symbols. It was incredibly useful to have the same math characters in the C++ programs that I had in my published papers. To code up the formulae, essentially all I had to do was to cut & paste mathematical equations from my papers into the C++ programs, add a few implied multiplication dots, and bingo, I was generating graphs illustrating my theories. It's true that the cut & paste operation involved "ASCIIizing" the math alphanumerics and that I didn't try to handle constructions such as integral and summation (although one could do a good job of this). At various Unicode conferences, I've shown slides showing how the math expressions in my C++ programs looked almost the same as those in my published papers. For the sake of discussion, I include some of these slides at the 17th Unicode Conference in San Jose in September, 2000. This approach to coding my formulae was dramatically faster than the more traditional approach of using variable names like alpha, scriptL or whatever name one might use to ASCIIize the original mathematical variable names. In fact, computer programs are an excellent example where plain-text encodings play important roles. C++ compilers don't understand program code written in XML or HTML, although doing so is an admittedly interesting possible extension and work is going on to translate MathML expressions into C++ code. Then you'll be able to cut & paste MathML expressions into C++, but the expressions in the editor and debugger won't look anything like the original mathematics. One of the neat by products of using the math alphanumerics in computer languages (when and if) is that you'd see math variables like italic i in both edit and debug modes. One of the biggest disappointments for me when I learned how to program computers back in 1962 was how different the Fortran II code looked from the formulae I coded. With Unicode and the math alphanumerics, that no longer needs to be true. So I highly recommend that the math alphanumerics be allowed in variable names along with lots of math symbols. Java has made a great deal of progress in this area and I really wish the C++ standard's committees would do the same. Most math expressions are amazingly close to viable computer code provided the right notation is used. Whitehead once said that 90% of mathematics is notation and a perfect notation would be a substitute for thought. We're not there yet, nor will we ever be entirely. But we can be a lot closer. It's pretty incredible. 3. Use of Math Alphanumerics in Mark Up Languages With regard to Kent Karlsson's objections about LaTeX going other directions, that's the choice of the LaTeX implementers should they so choose. But from considerable personal experience I can vouch that TeX math texts would be much more readable with the math alphanumerics than TeX texts currently are. With my old PS Word Processor, I did all of my derivations literally on screen, not with pencil and paper. I.e., complicated calculus derivations all on screen! You'd never be able to do this on screen in TeX, since it's too verbose and you couldn't see the forest through the trees. My Unicode conference papers have given examples of TeX with and without Unicode and the math alphanumerics. The relative clarity of the expressions with Unicode and the math alphanumerics is simply stunning. Let me finally emphasize that a major reason for including the math alphanumerics in Unicode was to satisfy a requirement for the STIX group of math communities, one of which is MathML, which is XML, not plain text. So we shouldn't say that the main reason for including them is for plain text and we shouldn't recommend that they not be used in MathML text data. I think that I and others have shown that the math alphanumerics work for math although they clearly aren't the only way to represent mathematical variables on computers. One can certainly use markup of various kinds like TeX and XML and some markup is still needed to handle alphanumeric combinations not included in the proposed Plane 1 block. But there are real advantages to the math alphanumerics, not the least of which is a commonality that transcends markup and plain text. 4. If you can do it with markup, then do it with markup This maxim is used today sometimes with amazing fervor. I'd like to carry it to its logical conclusion and then make a remark about using the math alphanumerics in the XML markup language MathML. These thoughts are pertinent to Unicode Technical Report #20 entitled "Unicode in XML and other Markup Languages". It's intriguing to note that with enough markup one could use a universal character set that's substantially smaller than Unicode. For example, the only Latin characters would be a-z; upper-case letters, composite characters, and combining-mark sequences in general would be marked up versions of these base English letters. In mathematics, all variations involving the idea of equality would be marked up versions of the ASCII equal sign =. So the not-equal sign (U+2260) would be represented by something like =. Unicode currently contains many (Ken can say how many) such symbols and they all could be represented as marked-up versions of the single character code U+003D. The Indic character sets could largely be described with a single ISCII repertoire appropriately marked up, and Greek could be derived from Latin with appropriate markup (or vice versa to do history some justice). Similarly Cyrillic could largely be marked up Latin and Greek or if Greek is marked up Latin, then Cyrillic would be just marked-up Latin (some characters would have to have more than one Latin base character). All Korean Hangul symbols can be done with combining Jamos and most Han characters could be represented as marked-up radicals. Ultimately we'd only have to use, say, a set of 4000 base symbols to represent virtually all living natural language and mathematical characters now and most likely for centuries to come. The idea is really quite intriguing and from a research point of view, it's probably worth while pursuing. But somehow this pushes a bit too far the gray line of where to put markup relative to using enhanced characters. So where should one put such a gray line? Currently I suspect that everyone would resist marking up Latin to get Greek and Cyrillic. Many people would point to the problems of marking up Chinese radicals to get Han characters. Most people would resist using markup to represent combining-mark sequences and composite characters. And, sigh, some people think that MathML should resist using the new composite math alphanumerics in favor of the corresponding marked-up Latin and Greek alphabets. I'd argue that a real purist would use an upper-case attribute instead of upper-case letters (there are obvious advantages to doing so), but would probably allow (I'm conjecturing here, since I'm certainly no purist!) Greek and Cyrillic to be used without treating them as marked-up Latin. However, a real purist would presumably also use mark up for all accented characters and probably would use mark up for all equality related concepts in math. Somehow I don't think many people on this list are so pure (Joe, Rick maybe?); we're all sinners to some degree! But the desirability of using markup is just one virtue that should be balanced against others, such as efficiency, convenience, portability, and compatibility. Mark up certainly isn't the only virtuous thing around (although you may have noticed in the recent Forum 2000 that Microsoft "is betting the company" on XML). From a mathematical software point of view, the math alphanumerics are just as desirable as the not-equal sign: they're all just independent math symbols. The MathML engines that have been implemented to date use these alphanumerics as such (in the PUA) and in general aren't prepared to change over to using corresponding XML tree structures. XML frankly is very verbose when used to mark up objects as small as characters and MathML already suffers accordingly. Not using the math alphanumerics in MathML would add yet another level of overhead bloat to no particular advantage. So from the points of view of efficiency, convenience, portability, and compatibility, the math alphanumerics are desirable in MathML, which is why the MathML community lobbied for their inclusion in Unicode in the first place.