RE: UTC Agenda item: Mathematical Letter Symbols

L2/00-249

From: Karlsson Kent - keka [keka@im.se]
Sent: Thursday, August 03, 2000 6:49 AM
To: Multiple Recipients of Unicore
Subject: RE: UTC Agenda item: Mathematical Letter Symbols

Regarding the "math alphanumeric characters" proposal
-----------------------------------------------------------------------------

I've finally got some time to comment on this issue. I've been too busy
editing a somewhat math oriented document which does do distinctions
between upright non-bold, bold, and italic versions of the same sequence
of letters, as well as between bold and non-bold versions of the same
symbols (for plus, minus, and infinity, as it happens). It also uses
multi-letter identifiers in math expression. That the identifiers are multi-
letter is important. The document would be unreadable if single-letter
identifiers had been used throughout.

I'm very strongly opposed to the "math alphanumeric characters" proposal.
As someone that would be a 'user' of the "math alphanumeric characters" if they
were to be accepted and then used in e.g. MathML, I very much fear the problems
that will result: problems setting/changing variety, problems with searches,
problems getting the desired identifiers in the desired variety. E.g., I might
not be able to get a bold "oändlig", or at least have severe problems in finding
and using a work-around. This is not an unrealistic example, the document
I've been busy with has the bold identifier "infinitary" (in math expressions!).
If I were to translate the document to Swedish, that would be a bold "oändlig".
And the "math alphanumeric characters" do not allow me to write that!

Character properties

The 'math alphanumeric characters' are not symbols any more than an
ordinary letter is a symbol. So these characters, if adopted (which they definitely
should NOT be), should unequivocally be given the general categories Lu, Ll, and
Nd as appropriate, with compatibility (<font>) mappings to the ordinary letters
and digits. Notice that even the proponents of these "math alphanumeric characters"
seem to propose to use the ordinary letters and digits in math expressions too
(though it is not entirely clear for exactly what; upright non-bold letters and digits?).
Notice that (Latin, Greek) letters in math expressions are most commonly
italic. The non-italic letters in math expressions are much more of an exception.
That is why (La)TeX by default makes letters in math expressions in italic.

Alleged added mark-up verbosity

The only "hard and fast" argument for including these "math
alphanumeric characters" appears to be to "save some bandwidth" in that using
mark-up instead would be more verbose. This is, however, 100% false. If the
mark-up scheme is done in any reasonable way, using mark-up instead is
(marginally) LESS verbose than using these "math alphanumeric characters".
Example:
     "math alphanumeric characters" (in a MathML setting):
        <mi>abc</mi>                                     (upright non-bold???)
        <mi>&bolda;&boldb;&boldc;</mi>    (upright bold)
        <mi>&fraka;&frakb;&frakc;</mi>      (fraktur)
        (etc. for the less than handful of different varieties)

    (one possible, reasonably done) "mark-up instead" alternative:
        <mr>abc</mr>    (upright non-bold)
        <mb>abc</mb>    (upright bold)
        <mf>abc</mf>    (fraktur)
        (etc. for the less than handful of different varieties)

Shortening the entity names or using the "math alphanumeric characters"
directly (in UTF-8 or UTF-16), which the proponents apparently suggest,
is still more verbose than the alternative mark-up version given here.
There is only a handful of varieties, I'm NOT suggesting that each and
every font difference counts. (I'm also avoiding the word "style" since
some people seem to misunderstand what that would mean.)

Bold (non-alphanumeric) symbols

If "math alphanumeric characters" are 'needed' because of semantic distinctions
between the few varieties, then all "math symbols" (category Sm) also need
to be duplicated in bold versions. Is this the plan? If not, why not? Bold
symbols are sometimes used in a semantically distinct way relative to the
corresponding non-bold symbol. The reasoning for both "math alphanumeric
characters" and "bold math symbols" would be the same, and should be treated
the same way when it comes to encoding considerations!

Bold symbols are in LaTeX obtained via the \boldsymbol command, or via the \pmb
command. (\pmb is 'poor mans bold' which simulates bold by overtyping. Handy
if the bold symbol desired is not available (in true bold) in the symbol font installed.)

Semantic significance

The different varieties of letters in math, like italic, bold, fraktur, does signify a
semantic difference, so does bold vs. non-bold versions of other (Sm) symbols.
This does not mean that this difference need to be mediated through
different character allocations. Indeed, MathML makes a semantic
difference between <mi> and <mn>, as well as a host of other such
differences. There is no reason why MathML, and similar mark-up schemes,
could not make the difference between, say, italic and fraktur a mark-up one,
<mi> (italic), <mf> (fraktur).

Math is inherently "non-plain" text

Very little math can be written without mark-up of some sort. Also Murray's
"plain text math" is a (very own) kind of mark-up.

Multi-letter identifiers and I18n

Some branches of math, computing science in particular, use multi-letter identifiers
also in mathematical expressions. If these are expressed in any other language
than English, making them, e.g. bold, suddenly needs a different mechanism for
making them so. It is very unlikely that any systems will handle this gracefully
if they are geared towards using "math alphanumeric characters". Likewise,
making symbols bold will require a separate mechanism, unless you plan to
also allocate "bold math symbols" as separate characters.

Old TeX vs. modern LaTeX

Old TeX used commands like \calE to get a calligraphic ('script') E. Each available
letter in each available 'math' variety had its own command. This is very similar
to the "math alphanumeric characters" proposal.

However, modern LaTeX has abandoned that approach, and instead use parametric
commands, where the parameter is the letters (plural!) to be set in a particular
variety. E.g. \mathcal{E} to get a calligraphic ('script') E. This way multi-letter
identifiers can gracefully be handled, and allows in principle multi-letter identifiers
(in math expressions!) that need not be derived from *English* words, but
can be from some other language.

LaTeX math identifier 'commands' (cmp. 'mark-up'):

\mathit{abc}    Italic (in principle, default for single-letter identifiers in LaTeX)
\mathbf{abc}   Bold identifiers
\mathrm{abc}   Upright, non-bold (typically: "sup", "sin", "lim", ...)
\mathcal{abc}   "Calligraphic"/"Script" identifiers
\mathsf{abc}   Sans-serif identifiers
\mathtt{abc}   "Teletype"/"monospace"/"typewriter" identifiers
\frak{abc}   Fraktur identifiers (amstex package)
\Bbb{abc}   Double-struck (black-board bold) identifiers (amstex package)

There is nothing *in principle* preventing "internationalised" identifiers here.

Note that LaTeX (with amstex package) also has:

\boldsymbol{+} Bold symbols (incl. sequences of symbols; \boldsymbol{+\inf})
\pmb{+} Fallback for bold symbols ('poor mans bold'; does overtyping; useful
if the symbol font does not have the desired symbol(s) in "true" bold)

There is no problem to introduce similar mark-up distinctions in MathML-ish
schemes, for example like this (just an example of how it could be done):

<mi>abc</mi>    Italic identifiers
<mb>abc</mb>   Bold identifiers
<mr>abc</mr>   Upright, non-bold (typically: "sup", "sin", "lim", ...)
<mc>abc</mc>   "Calligraphic"/"Script" identifiers
<ms>abc</ms>   Sans-serif identifiers
<mt>abc</mt>   "Teletype"/"monospace"/"typewriter" identifiers
<mf>abc</mf>   Fraktur identifiers
<md>abc</md>   Double-struck (black-board bold) identifiers

<mn>123</mn> Upright non-bold numerals
<mm>123</mm> Bold numerals
<ml>123</ml> Italic numerals

<mo>+</mo> Non-bold symbols
<mp>+</mp> Bold symbols

There is nothing in principle preventing "internationalised" identifiers here.
This method does not affect Unicode in any way, no new characters at all.
But it does allow for 1) internationalised multi-letter identifiers, and 2)
bold symbols too. And that without any private use characters, plane 1
characters, and no bold clones of symbols. It's more general and flexible too.
If mathematics develops so that, say, italic sans-serif were a new recognised
variety, no new characters need be added, just a new tag in the mark-up scheme.

Existing "math alpha chars" should NOT be used

The existing "math alphanumeric" characters (in the BMP) should NOT be used.
In particular not with mark-up schemes that can (and should) do the distinction
by mark-up (like <mi>i</mi>, <mc>R</mc>, etc.). That the existing "math
alphanumeric" characters (in the BMP) were ever encoded should be regarded
as a mistake.

/Kent Karlsson