**Next message:**WXR_at_Anacomp-ENG1@anacomp.com: "unscribe"**Previous message:**John Clews: "SUBSCRIBING TO THE UNICODE EMAIL LIST"**Next in thread:**Kai Henningsen: "Re: Unicode plain-text encoding of mathematics"**Maybe reply:**Kai Henningsen: "Re: Unicode plain-text encoding of mathematics"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

At the 29-sep-95 UTC meeting, we decided that the Unicode plain-text

math notation is tricky enough that it should be discussed by a

subcommittee consisting of interested individuals before being brought

to the UTC as a whole. Mark Davis and I got the ball rolling with the

simple proposal that follows (with a couple of enhancements here and

there :). Other people who have expressed interest are Asmus Freytag

and Lisa Moore. Please let me know if you want to be involved in this

subcommittee. The following is written without built-up formulas so

that it can go through email easily. Unfortunately that makes it

harder to read, so in the future I'd prefer to switch to Microsoft Word

so that we have a way to display built-up formulas.

Simple Unicode Plain-Text Math Notation

A guiding principle governing the choice of Unicode semantics and

special characters is that "plain text must contain enough information

to permit the text to be rendered legibly, nothing more" (p. 10, The

Unicode Standard, Vol. 1). To this end, Mark Davis and I have come up

with an initial step in developing a mathematics encoding scheme. We

add to the Unicode plain-text legibility principle the desires to be as

simple as possible and to look like ordinary mathematics if possible.

In this spirit, we don't pretend to be able to round-trip an arbitrary

mathematical expression through our plain text. For example, TeX has

fancy-text enhancements that go beyond legibility, such as specifying

various sizes of brackets. On the other hand, our proposed notation is

very much like a simple Unicode version of TeX's mathematics notation.

Background information is given in my paper "Unicode, Rich Text, and

Mathematics" in the proceedings of the Seventh Unicode Conference. In

the following, I summarize what Mark and I came up with at the

29-sep-95 UTC meeting along with some problems that remain to be resolved.

1. Subscripts and Superscripts

The first notation to render legible is that for subscripts and

superscripts. In the discussion, we define the character unit, which is

used in all expressions that have built-up display forms. Currently

Unicode has about 16 subscripts and 16 superscripts (see, e.g., U+2070

through U+208E), which, while quite useful, are clearly not general

enough to satisfy mathematical needs. While fancy-text systems can

readily render at least one level of subscripts or superscripts,

exporting such fancy text to plain text loses the subscript/superscript

information, i.e., makes the text illegible. TeX uses the principle

that the character unit that follows an underscore is to be subscripted

to the character unit that preceeds the underscore, e.g., a_2 displays

"a" followed by a subscripted 2. To allow for subscripting more than

one character, the TeX character unit can be either a single character

or any arbitrary expression enclosed in a pair of curly braces, e.g.,

{b + c} is a character unit consisting of "b + c". So a_{b + c}

subscripts "b + c" (without the quotes, of course). TeX superscripts

work the same way, except that the superscript operator ^ is used in

place of the subscript operator _.

We propose three simple changes to this approach:

a) We replace TeX's superscript operator ^ by a dedicated superscript

operator with a display form that looks like a superscripted up arrow.

Similarly, we replace TeX's subscript operator _ by a dedicated

subscript operator that looks like a subscripted down arrow. It's

interesting to note that pre-ASCII versions of TeX used a large private

character set that had such special subscript and superscript operator

characters. Their introduction frees up ^ and _ to represent themselves.

b) We enclose multiple-character subscript and superscript expressions

with parentheses instead of TeX's {}. Just as TeX doesn't display the

{} when the expressions are displayed in built up form, the parentheses

enclosing an expression immediately following a subscript or

superscript operator are not displayed in built up form. If a set of

parentheses should nevertheless be displayed, an additional set can be

included within the "outermost" parentheses. This use of parentheses

frees up {} to represent themselves.

c) If the character unit consists of any Unicode space character

(U+0020, U+2002 through U+200B), the preceding operator is displayed in

its simple display form. This helps one in writing text to document

this approach, and it's handy for displaying ordinary size versions of

various operators discussed below.

In summary, for the subscript and superscript operators (and other

operators described below), a character unit can consist of either a

single character or an arbitrary expression enclosed within

parentheses. We see next that the use of ()'s leads to a natural

plain-text mathematical notation for fractions and square roots,

whereas {}'s do not.

2. Fractions

Unicode's fraction-slash operator (U+2044) exists "for composing

arbitrary fractions" (p. 280, The Unicode Standard, Vol. 1).

Accordingly, we propose that the character unit preceding the

fraction-slash operator be the numerator of a fraction and the

character unit following the fraction-slash operator be the

denominator. As explained in topic 1b), if the character unit consists

of a parenthesized expression, the outermost level of parentheses are

not displayed in built up form. So the expression (a + b)/c, where /

is the Unicode fraction slash, would have the numerator "a + b" and

denominator "c". Note that (a + b)/c is a conventional mathematical

expression, whereas TeX's equivalent, {a + b \over c} is not.

One problem with this simple approach is that an expression like a/bc

has the built-up form corresponding to {a \over b}c, whereas according

to usual mathematical convention, it should display as a/(bc), that is,

as {a \over bc}. An advantage of the simple approach is that it

corresponds precisely to TeX, which might aid a bit in converting

between the two. My papers discuss a slightly more complicated

definition of a character unit that corresponds to conventional

mathematical notation. Specifically, I define a character unit in 1b)

with the extension that any span of nonoperator characters also

qualifies as a character unit. So "abc" consists of a single character

unit (quotes not included), while "a + b" consists of 5 character units.

3. Associativity

Another question arises with both the sub/superscript operators and

fractions, namely what do you do with two or more such operators in a

row, e.g., a/b/c. TeX simply reports an error, but it seems more

natural to follow the rules of arithmetic. For fractions, (a/b)/c

equals a/(b/c), so the only difference is where the baseline is drawn,

a choice outside the principle of legibility. However, using TeX's ^

for email purposes, a^b^c can be grouped as a^(b^c) or as (a^b)^c,

which are in general quite different. For example, (2^3)^4 = 4096,

while 2^(3^4) = 2^81.

We can resolve these ambiguities by using associativity, which is the

rule that says which way the parentheses should be inserted. For

subscripts and superscripts, I recommend right-to-left associativity,

so that a^b^c is grouped as a^(b^c), i.e., rightmost first. However

for fractions, I recommend left-to-right associativity, so that a/b/c

is grouped as (a/b)/c. As noted earlier, the latter is only a

formatting refinement; the values are the same for fractions with

either grouping.

4. Precedence

And yet another question arises, namely what do we mean by mixed

fractions, subscripts, and superscripts, such as, a/b^c ? The ordinary

rules of algebra give a clear answer: do exponentiations before

fractions, e.g., group a/b^c as a/(b^c). More formally, we say that

subscripts and superscripts have higher precedence than multiplication

and division. These, in turn, have higher precedence than addition,

subtraction, and other binary operators.

You only have to look at the associativity and precedence rules of the

C programming language to know that this subject can get too

complicated for most people! So TeX didn't bother with either and just

reports errors when you aren't totally explicit. I recommend a

compromise: we have a little bit of associativity and precedence,

namely that for subscripts, superscripts, and fractions as given here,

since it gives a legible plain text for simple algebraic expressions,

which comprise the lion's share of our target audience.

5. Integrals, Summation, and Products

Limits for integrals (U+222B through U+2233), summation (U+2211),

product (U+220F), and coproduct (U+2210) operators are given using the

subscript and superscript conventions, in a fashion analogous to TeX.

Specifically, if subscript and/or superscript operators follow one of

these operators, the associated subscript and/or superscript

expressions are used as the corresponding operator limits. When these

operators have such limits, they are ideally displayed in built-up form.

6. Square Roots and Friends

The square, cube, and fourth root operators (U+221A, U+221B, and

U+221C, respectively) are understood to enclose the character unit that

follows them. Other root operations can be displayed in the general

form of a parenthesized expression raised to the reciprocal of the root

power. The root operators have lower precedence than subscript and

superscript operators, so the sequence U+221A a^2 groups as U+221A (a^2).

6. Bracketed expressions

In general bracketed expressions, e.g., things enclosed in (), [], {},

and <>, should be displayed in built-up form with the brackets sized

large enough to enclose the expressions. There are cases where one

wants to defeat such automatic sizing and display the brackets as

ordinary characters instead. This desire can be indicated by following

the opening bracket with a Unicode ZWNJ, which then treats the opening

bracket as an ordinary character. A closing bracket with no

corresponding opening bracket should be treated as an ordinary

character, so turning the opening bracket into an ordinary character

also turns the corresponding closing bracket into one. Admittedly this

refinement is an enhancement to the Unicode principle of legibility,

but it's sufficiently simple that it seems worthwhile to include.

8. What's missing?

Most important, vertical alignment such as used in matrices is missing.

Such alignment is very much like that used in fractions, but has no

fraction bar. Accordingly we need to introduce an alignment operator

analogous to the fraction slash (U+2044). My PS Technical Word

Processor uses a large vertical bar for this purpose since it was handy

way back when, but we should design a more intuitive glyph for the purpose.

There are other built-up mathematical constructs, such as an overscore,

which usually means to take the average of the overscored expression.

For minimum legibility, one might just drop such an overscore, but then

the meaning is clearly different.

9. Math style

As my papers in the Unicode conferences discuss, sophisticated software

can do remarkably well at distinguishing mathematical expressions from

natural language, and thereby be able to display such expressions with

appropriate formatting, e.g., italicize English letters used in the

names of variables. However I haven't been able to come up with 100%

reliable algorithms to distinguish between the two and the algorithms I

have come up with in general do not qualify as being simple, although

they'd be great for math-format wizard software. TeX never attempted

to solve this problem and requires "math mode" to be toggled on and off

by $'s. Woe be it to him or her who forgets a $! Fancy text can do

this elegantly by ascribing a math style to runs of text that contain

mathematical expressions, an approach that is ideal from a content

point of view.

There's no doubt that legible mathematical text requires at least some

such treatment, even if sophisticated heuristics are used. However the

Unicode principle of legibility may not. My preference is that we

introduce embedded "math-on" and "math-off" codes to be able to

preserve a more satisfactory level of legibility. Mark prefers not to.

My experience with the PS Technical Word Processor shows that the

presence of such codes (or formatting) makes it a lot easier to convert

files to and from TeX, which is a desirable feature, considering the

dominance of TeX in the technical text field. Your opinions are solicited.

In summary, this proposal gives a set of rules for a simple Unicode

plain-text encoding of mathematical expressions. It borrows heavily

from TeX and can be converted to and from simple TeX with fairly simple

software. It has the advantage of looking like, or nearly like,

conventional mathematical notation, which makes it substantially easier

to use and read than the corresonding TeX. We need to address such

issues as 1) whether the elegance yielded by my extended character-unit

definition justifies the increased explanation required (we need to

define which characters are operators), and 2) whether the level of

legibility should be raised high enough to require the introduction of

math-on/math-off punctuation characters. In addition, we need to

discuss any other issues that you think are important.

Thanks

Murray

**Next message:**WXR_at_Anacomp-ENG1@anacomp.com: "unscribe"**Previous message:**John Clews: "SUBSCRIBING TO THE UNICODE EMAIL LIST"**Next in thread:**Kai Henningsen: "Re: Unicode plain-text encoding of mathematics"**Maybe reply:**Kai Henningsen: "Re: Unicode plain-text encoding of mathematics"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

*
This archive was generated by hypermail 2.1.2
: Tue Jul 10 2001 - 17:20:30 EDT
*