Unicode plain-text encoding of mathematics

From: Murray Sargent (murrays@microsoft.com)
Date: Sat Oct 07 1995 - 14:20:35 EDT


At the 29-sep-95 UTC meeting, we decided that the Unicode plain-text
math notation is tricky enough that it should be discussed by a
subcommittee consisting of interested individuals before being brought
to the UTC as a whole. Mark Davis and I got the ball rolling with the
simple proposal that follows (with a couple of enhancements here and
there :). Other people who have expressed interest are Asmus Freytag
and Lisa Moore. Please let me know if you want to be involved in this
subcommittee. The following is written without built-up formulas so
that it can go through email easily. Unfortunately that makes it
harder to read, so in the future I'd prefer to switch to Microsoft Word
so that we have a way to display built-up formulas.

Simple Unicode Plain-Text Math Notation

A guiding principle governing the choice of Unicode semantics and
special characters is that "plain text must contain enough information
to permit the text to be rendered legibly, nothing more" (p. 10, The
Unicode Standard, Vol. 1). To this end, Mark Davis and I have come up
with an initial step in developing a mathematics encoding scheme. We
add to the Unicode plain-text legibility principle the desires to be as
simple as possible and to look like ordinary mathematics if possible.

In this spirit, we don't pretend to be able to round-trip an arbitrary
mathematical expression through our plain text. For example, TeX has
fancy-text enhancements that go beyond legibility, such as specifying
various sizes of brackets. On the other hand, our proposed notation is
very much like a simple Unicode version of TeX's mathematics notation.
Background information is given in my paper "Unicode, Rich Text, and
Mathematics" in the proceedings of the Seventh Unicode Conference. In
the following, I summarize what Mark and I came up with at the
29-sep-95 UTC meeting along with some problems that remain to be resolved.

1. Subscripts and Superscripts

The first notation to render legible is that for subscripts and
superscripts. In the discussion, we define the character unit, which is
used in all expressions that have built-up display forms. Currently
Unicode has about 16 subscripts and 16 superscripts (see, e.g., U+2070
through U+208E), which, while quite useful, are clearly not general
enough to satisfy mathematical needs. While fancy-text systems can
readily render at least one level of subscripts or superscripts,
exporting such fancy text to plain text loses the subscript/superscript
information, i.e., makes the text illegible. TeX uses the principle
that the character unit that follows an underscore is to be subscripted
to the character unit that preceeds the underscore, e.g., a_2 displays
"a" followed by a subscripted 2. To allow for subscripting more than
one character, the TeX character unit can be either a single character
or any arbitrary expression enclosed in a pair of curly braces, e.g.,
{b + c} is a character unit consisting of "b + c". So a_{b + c}
subscripts "b + c" (without the quotes, of course). TeX superscripts
work the same way, except that the superscript operator ^ is used in
place of the subscript operator _.

We propose three simple changes to this approach:

a) We replace TeX's superscript operator ^ by a dedicated superscript
operator with a display form that looks like a superscripted up arrow.
Similarly, we replace TeX's subscript operator _ by a dedicated
subscript operator that looks like a subscripted down arrow. It's
interesting to note that pre-ASCII versions of TeX used a large private
character set that had such special subscript and superscript operator
characters. Their introduction frees up ^ and _ to represent themselves.

b) We enclose multiple-character subscript and superscript expressions
with parentheses instead of TeX's {}. Just as TeX doesn't display the
{} when the expressions are displayed in built up form, the parentheses
enclosing an expression immediately following a subscript or
superscript operator are not displayed in built up form. If a set of
parentheses should nevertheless be displayed, an additional set can be
included within the "outermost" parentheses. This use of parentheses
frees up {} to represent themselves.

c) If the character unit consists of any Unicode space character
(U+0020, U+2002 through U+200B), the preceding operator is displayed in
its simple display form. This helps one in writing text to document
this approach, and it's handy for displaying ordinary size versions of
various operators discussed below.

In summary, for the subscript and superscript operators (and other
operators described below), a character unit can consist of either a
single character or an arbitrary expression enclosed within
parentheses. We see next that the use of ()'s leads to a natural
plain-text mathematical notation for fractions and square roots,
whereas {}'s do not.

2. Fractions

Unicode's fraction-slash operator (U+2044) exists "for composing
arbitrary fractions" (p. 280, The Unicode Standard, Vol. 1).
Accordingly, we propose that the character unit preceding the
fraction-slash operator be the numerator of a fraction and the
character unit following the fraction-slash operator be the
denominator. As explained in topic 1b), if the character unit consists
of a parenthesized expression, the outermost level of parentheses are
not displayed in built up form. So the expression (a + b)/c, where /
is the Unicode fraction slash, would have the numerator "a + b" and
denominator "c". Note that (a + b)/c is a conventional mathematical
expression, whereas TeX's equivalent, {a + b \over c} is not.

One problem with this simple approach is that an expression like a/bc
has the built-up form corresponding to {a \over b}c, whereas according
to usual mathematical convention, it should display as a/(bc), that is,
as {a \over bc}. An advantage of the simple approach is that it
corresponds precisely to TeX, which might aid a bit in converting
between the two. My papers discuss a slightly more complicated
definition of a character unit that corresponds to conventional
mathematical notation. Specifically, I define a character unit in 1b)
with the extension that any span of nonoperator characters also
qualifies as a character unit. So "abc" consists of a single character
unit (quotes not included), while "a + b" consists of 5 character units.

3. Associativity

Another question arises with both the sub/superscript operators and
fractions, namely what do you do with two or more such operators in a
row, e.g., a/b/c. TeX simply reports an error, but it seems more
natural to follow the rules of arithmetic. For fractions, (a/b)/c
equals a/(b/c), so the only difference is where the baseline is drawn,
a choice outside the principle of legibility. However, using TeX's ^
for email purposes, a^b^c can be grouped as a^(b^c) or as (a^b)^c,
which are in general quite different. For example, (2^3)^4 = 4096,
while 2^(3^4) = 2^81.

We can resolve these ambiguities by using associativity, which is the
rule that says which way the parentheses should be inserted. For
subscripts and superscripts, I recommend right-to-left associativity,
so that a^b^c is grouped as a^(b^c), i.e., rightmost first. However
for fractions, I recommend left-to-right associativity, so that a/b/c
is grouped as (a/b)/c. As noted earlier, the latter is only a
formatting refinement; the values are the same for fractions with
either grouping.

4. Precedence

And yet another question arises, namely what do we mean by mixed
fractions, subscripts, and superscripts, such as, a/b^c ? The ordinary
rules of algebra give a clear answer: do exponentiations before
fractions, e.g., group a/b^c as a/(b^c). More formally, we say that
subscripts and superscripts have higher precedence than multiplication
and division. These, in turn, have higher precedence than addition,
subtraction, and other binary operators.

You only have to look at the associativity and precedence rules of the
C programming language to know that this subject can get too
complicated for most people! So TeX didn't bother with either and just
reports errors when you aren't totally explicit. I recommend a
compromise: we have a little bit of associativity and precedence,
namely that for subscripts, superscripts, and fractions as given here,
since it gives a legible plain text for simple algebraic expressions,
which comprise the lion's share of our target audience.

5. Integrals, Summation, and Products

Limits for integrals (U+222B through U+2233), summation (U+2211),
product (U+220F), and coproduct (U+2210) operators are given using the
subscript and superscript conventions, in a fashion analogous to TeX.
Specifically, if subscript and/or superscript operators follow one of
these operators, the associated subscript and/or superscript
expressions are used as the corresponding operator limits. When these
operators have such limits, they are ideally displayed in built-up form.

6. Square Roots and Friends

The square, cube, and fourth root operators (U+221A, U+221B, and
U+221C, respectively) are understood to enclose the character unit that
follows them. Other root operations can be displayed in the general
form of a parenthesized expression raised to the reciprocal of the root
power. The root operators have lower precedence than subscript and
superscript operators, so the sequence U+221A a^2 groups as U+221A (a^2).

6. Bracketed expressions

In general bracketed expressions, e.g., things enclosed in (), [], {},
and <>, should be displayed in built-up form with the brackets sized
large enough to enclose the expressions. There are cases where one
wants to defeat such automatic sizing and display the brackets as
ordinary characters instead. This desire can be indicated by following
the opening bracket with a Unicode ZWNJ, which then treats the opening
bracket as an ordinary character. A closing bracket with no
corresponding opening bracket should be treated as an ordinary
character, so turning the opening bracket into an ordinary character
also turns the corresponding closing bracket into one. Admittedly this
refinement is an enhancement to the Unicode principle of legibility,
but it's sufficiently simple that it seems worthwhile to include.

8. What's missing?

Most important, vertical alignment such as used in matrices is missing.
 Such alignment is very much like that used in fractions, but has no
fraction bar. Accordingly we need to introduce an alignment operator
analogous to the fraction slash (U+2044). My PS Technical Word
Processor uses a large vertical bar for this purpose since it was handy
way back when, but we should design a more intuitive glyph for the purpose.

There are other built-up mathematical constructs, such as an overscore,
which usually means to take the average of the overscored expression.
For minimum legibility, one might just drop such an overscore, but then
the meaning is clearly different.

9. Math style

As my papers in the Unicode conferences discuss, sophisticated software
can do remarkably well at distinguishing mathematical expressions from
natural language, and thereby be able to display such expressions with
appropriate formatting, e.g., italicize English letters used in the
names of variables. However I haven't been able to come up with 100%
reliable algorithms to distinguish between the two and the algorithms I
have come up with in general do not qualify as being simple, although
they'd be great for math-format wizard software. TeX never attempted
to solve this problem and requires "math mode" to be toggled on and off
by $'s. Woe be it to him or her who forgets a $! Fancy text can do
this elegantly by ascribing a math style to runs of text that contain
mathematical expressions, an approach that is ideal from a content
point of view.

There's no doubt that legible mathematical text requires at least some
such treatment, even if sophisticated heuristics are used. However the
Unicode principle of legibility may not. My preference is that we
introduce embedded "math-on" and "math-off" codes to be able to
preserve a more satisfactory level of legibility. Mark prefers not to.
 My experience with the PS Technical Word Processor shows that the
presence of such codes (or formatting) makes it a lot easier to convert
files to and from TeX, which is a desirable feature, considering the
dominance of TeX in the technical text field. Your opinions are solicited.

In summary, this proposal gives a set of rules for a simple Unicode
plain-text encoding of mathematical expressions. It borrows heavily
from TeX and can be converted to and from simple TeX with fairly simple
software. It has the advantage of looking like, or nearly like,
conventional mathematical notation, which makes it substantially easier
to use and read than the corresonding TeX. We need to address such
issues as 1) whether the elegance yielded by my extended character-unit
definition justifies the increased explanation required (we need to
define which characters are operators), and 2) whether the level of
legibility should be raised high enough to require the introduction of
math-on/math-off punctuation characters. In addition, we need to
discuss any other issues that you think are important.

Thanks
Murray



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT