Re: Public Review Issue Unicode Technical Report #25, "Unicode Support for Mathematics"

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jan 11 2007 - 23:04:12 CST


<quote>
2.8 Superscripts and Subscripts
The Superscripts and Subscripts block U+2070.. U+209F together with U+00B2, U+00B3, and U+00B9 contain a collection superscript and subscript digits and punctuation that can be useful in mathematics. If they are used, it is recommended that they be displayed with the same font size as other subscripts and superscripts at the corresponding nested script level. For example, aČ and a<super>2</super> should be displayed the same. However, these subscript/superscript characters are not used in MathML or TEX and their use with XML documents is discouraged, see Unicode Technical Report #20, Unicode in XML and other Markup Languages [UXML]. Editors for these formats may offer facilities to convert these characters to regular characters plus markup.
</quote>

I'm quite surprized at the conclusions given here about "'discouraging" the use of superscript and subscript characters, despite there are languages into which they are letters distinct from their respective baseline character counterparts.

I cited Minnan in another message, and this is a good example here too!: the superscript n (U+207F, encoded <E2,81,BF> with UTF-8) is used in fact a letter modifier, used after a vowel to nazalize it; it's halfway between a plain consonnant and a diacritic (mostly like the anusvara in Indic scripts). Note also that U+207F it is definitely not the same letter as a regular n letter. It is present in many legacy charsets (including DOS codepage 437 for US,and other "OEM" sets, at position 0xFC, such as DOS Arabic, DOS Greek, DOS Hebrew, but also the oldest legacy Chinese charsets).

On the opposite, this paragraph does not speak about U+00AA (feminine ordinal mark) and U+00BA (masculine ordinal mark) despite their meaning is clearly derived from reagular small letters a and o. It's probably because these letters are often rendered in fonts with a underscore (or combining low macron) and that this underscore does not break the position of underlining which remains below the baseline of non superscripted characters...
But it should be noted too, in this paragraph, the effect of converting superscript/subscripts characters into regular characters plus markup; this has the effet of moving the baseline and reducing character height and bounding box, so it affects at least:
* the rendering of characters decorated with style such as: background coloring, underlining, overlining, framing, overstrikes (see <del> and <s> in HTML), which may be used to annotate a text with visible metadata rendered as text decorations (the HTML text of the UTR Public review is an example for such uses!)
* and possibly too, the rendering of text selection and position and size of the blinking input caret

So I think that the quoted sentences above are too broad; they are problably true for the subscript/superscript digits, parentheses, equal sign, signs, operators and other punctuation marks, but more care should be taken about superscript/supscript letters, even if we are in the context of mathematics, because mathematic text can include plain-text parts containing normal words with the regular meaning in their language (even in LaTeX documents), and converting them blindly to base letters plus markup may breakthe text.

Letters that are encoded assuperscript/subscript may have a distinct meaning that must be preserved, and which may be more than just convenience.They should be *prefered*to markup in that case, and their redering should *not* be required to exhibit the same glyph as regular letters modified with markup for superscript/subscript.

On the opposite, when these letters have the semantic of normal variables (for example in mathematics, when using superscripts to notate exponentation, or subscripts to notate indices) the markup should be used instread of encoding superscript/subscript letters. For digits, punctuation (including decimal point or comma and parentheses) and maths operators (plus and minus-hyphen signs), there is probably no such issue: in maths documents, they will represent most certainly and unambiguously the same variables, operators or numbers as those encoded as full height characters on baseline and, consequently, conversion to regular letters plus markup could be proposed.



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:55:40 CST