L2/00-240

Kenneth Whistler <kenw@sybase.com> on 07/07/2000 12:13:38 PM

Please respond to unicore@unicode.org

Subject:  Re: UTC Agenda item: Mathematical Letter Symbols


Mark suggested:

> There are two topics that we need to cover at the next meeting having
> to do with the Mathematical Letter Symbols
>
> 1. Do we recommend the use of these characters in rich-text
> environments: in environments like MathML that have rich enough
> structure to encode the proper information (and more, of course)?
>
> 2. Do we categorize these characters as Letters or as Symbols?

I have no quarrel with the need to take up this discussion at the
UTC meeting. We do need to make a determination on these.

However, I disagree with Mark's conclusions here.

>
> Here are my thoughts on them.
>
> 1. Markup.
>
> Fundamentally, once the characters are encoded in Unicode 3.1, and are
> used in accordance with their plaintext semantics, their use is
> conformant even in environments where they would be better replaced by
> markup or other out-of-band information. So in some sense, the only
> thing the UTC can do is make a recommendation. However, we should try
> to give guidance on the use of these characters and their interaction
> with markup. Since mathematics (except for fragments) has
> fundamentally a non-linear structure, thus requiring markup or
> equivalent for correct representation, and since mathematics is
> fundamentally generative (with some inventive mathematicians somewhere
> using some interesting glyphs to convey some distinction), I think our
> recommendation should be to replace the clones with markup in
> interchange.

It is my understanding that both the MathML community and the math
layout software companies want these encoded as characters precisely
so they don't need to do markup (or apply styles) to these. They
want to be able to have these a primitive entities in the underlying
representation -- i.e. as characters. Making the recommendation to
replace the alphanumeric symbols with markup in interchange is contra
to the reason for encoding them in the first place.

Note that your recommendation would also be tantamount to a recommendation
to replace the letterlike symbols (in block 21XX) with markup in
interchange -- and that is starting down the very slippery slope
of trying to get people to replace the use of compatibility
characters with marked-up Cleanicode.

I realize that UTR #20 is having to confront these issues, in
making recommendations for use of Unicode in XML and other
markup languages. However, it is one thing to recommend the non-use
of certain characters in a context *when you are using* a markup
language for interchange. It is another thing to generically recommend
the non-use of certain characters and their replacement by markup
in interchange. Further, I think we may need to distinguish different
situations even when markup languages are being used. Recommending
the non-use of the alphanumeric symbols in MathML, when the MathML
designers *want* to use them instead of markup, seems perverse to me.

>
> 2. Symbols
> The only basis for adding these characters are that they are NOT
> treated as letters -- that they are treated as symbols.

This is manifestly not the case. Yes, they are treated as symbols,
but that is not the "only basis" for adding them as characters.
They are encoded as characters for compatibility with existing
practice in Mathematica. They are encoded as characters to avoid
having to encode combining math style marks as characters. They
are encoded as characters to have basic entities to make textual
distinctions used by mathematicians without having to introduce
style markup to maintain those distinctions.

> Categorizing
> them as Sm -- mathematical symbols -- will result in more applications
> correctly handing them, and distinguishing them from the true letters.
>
> For consistency, we should revisit the few scattered characters in the
> BMP that are filling holes in the math characters, as listed in
> http://www.unicode.org/unicode/reports/tr24/charts/ScriptChart0.html.

As is often the case, this is a good idea just begging for trouble.

Effectively, Mark is arguing here to give up on the category
assignments for letterlike symbols that have been in the standard
since Unicode 2.0, to change the Lu's and Ll's in that set to Sm's.

Here are examples from UnicodeData-2.0.14.txt (the release version for
Unicode 2.0):

2102;DOUBLE-STRUCK CAPITAL C;Lu;0;ON;<font> 0043;;;;N;DOUBLE-STRUCK C;;;;
2108;SCRUPLE;So;0;ON;;;;;N;;;;;

Note that way back when, for Unicode update version 2.1.5, the bidi
categories for these anomalous Lu and Ll letterlike symbols were
corrected, to make them consistent with other letters. This was
a decision that UTC made explicitly. Here is UnicodeData-2.1.5.txt:

2102;DOUBLE-STRUCK CAPITAL C;Lu;0;L;<font> 0043;;;;N;DOUBLE-STRUCK C;;;;
2108;SCRUPLE;So;0;ON;;;;;N;;;;;

So Lurking Problem #1 is that changing the general category from Lu or
Ll for these will reintroduce the problem of inconsistency in their
bidirectional handling. Should the L's all be changed back to ON's if
these are changed to Sm's? Any change in bidi properties of the
existing repertoire now would have consequences for existing
implementations.

Lurking Problem #2 is that the issue extends beyond the set of
characters that were omitted from the repertoire of new alphanumeric
symbols because they were already encoded as letterlike symbols.
In particular, U+2107 EULER CONSTANT, U+210F PLANCK CONSTANT OVER TWO PI,
and the 4 Hebrew symbols U+2135 ALEF SYMBOL .. U+2138 DALET SYMBOL
also have Lu, Ll, or Lo general categories (and bidi category L). I
am presuming that Mark, for consistency, would want to switch these
also to Sm, even though there is no complementarity issue here with
the new alphanumeric alphabets on Plane 1.

Lurking Problem #3 is that this change has ramification for identifiers.
Lu, Ll, and Lo are all among the general category values recommended
for inclusion in identifiers. Sm (and So) are *not*. (See page 135 of
the Unicode Standard.) So at this point, any changes from Lx to Sx
for a character in the UCD constitutes a recommendation to change
the acceptable repertoire for identifiers. That will impact some
implementations -- and it will put the Unicode Standard recommendation
out of synch, once again, with Annex A to ISO TR 10176, after we just went
through the exercise of pushing through an Amendment to that TR, so
that they *would* be in synch.

Lurking Problem #4 is that the proposed change would impact case
tables. Changing Lu or Ll to Sm implies that we are recommending
that the character no longer be considered upper- or lower-case.
True, none of these letterlike symbols have case *mappings* now. But
any API which is currently returning True for isuppercase() or
islowercase() for these letters, should return False after this
change -- meaning changes in tables.

Further, I should remind people that the math property for characters
cannot be predicted from the general category value in the UCD anyway.
So if the issue is consistency of detection and behavior for these
characters by *mathematical* applications, then changing their
general category from Lu or Ll to Sm is basically moot. Math
applications also have to detect regular Greek letters (Lu, Ll) and lots
of punctuation with a whole variety of general category assignments.

The only correct handling consistency argument that I see holding any
water is for non-mathematical, general text applications to consistently
determine that the math alphanumerics and the relevant letterlike symbols
are *not* "true letters", as Mark puts it. And to do that, rather
than jiggering, once again, the sorely overloaded and imprecise
General Category values in the UCD, we should be examining the
math property for consistency instead. If we simply corrected that
listing, so that all of these letterlike symbols that pattern with
the math alphanumerics are consistently given the math property (along
with the 6 other outliers in the letterlike symbols I noted above),
then a general application has a way of telling these things apart by
property, as long as the math property is made available to it.

>
> I feel sufficiently strongly about this that if we cannot agree to
> change the few scattered characters, that we should go ahead and fill
> the holes, mark them all as Sm, and discourage the use of the
> scattered ones.

I wish this were stated as more than a sufficiently strong feeling.
What are the envisioned implementation problems or textual
interpretation catastrophes that make the current property assignments
untenable?

Why would not a review and correction of the math property assignment
not be a sufficient solution, without introducing the lurking problems
associated with changing a normative Lu or Ll assignment?

And what is so wrong with the current situation that it would lead you
to countenance:

   A. Encoding 25 characters that would manifestly be duplicates
      for 25 that we already have -- after both the UTC and WG2 took
      explicit decisions to omit those 25 from the Plane 1 repertoire
      for the math alphanumerics because they were already encoded.

   B. Deprecate 25 characters that people are already using in their
      intended usage, and tell them to use 25 others instead.

One of the conclusions that this is leading me (and I suspect Asmus) to
is that it is time to stop the fiction that some General Category
assignments are not normative. At this point any change to any value
of General Category in the UCD can impact many implementations. It is
time for the UTC to simply declare *all* values of that field to be
normative, but then to clarify what conformance to the General Category
values means. We have crossed a watershed for UnicodeData.txt, in
particular.
Just as we once accepted wholesale name changes for characters in the
Unicode Standard (when no one was depending on their values, and when
there was a good reason to do so), but now refuse to change *any* name,
we are approaching the point where making changes to values in
UnicodeData.txt is no longer a matter of "fixing" things, because any
change breaks as many things as it might fix. The interdependencies have
grown.

Secondly, I want to point out yet again the thing I have been harping
on for several years now. The General Category field in UnicodeData.txt
is overloaded and ill-designed for solving all character property-related
problems. It serves a bunch of useful purposes, but we cannot keep
loading it with more and more requirements and keep expecting to
get all program behavior to be consistent by jiggering the assignments
of characters here and there. The fact is that U+2102 DOUBLE-STRUCK
CAPITAL C *is* a letter, and *is* uppercase. It also functions as
a symbol, and in particular as a math symbol. It also may be appropriate
as an identifier (certainly for an implementation of a mathematical
algebra that treated mathematical variables as formal identifiers
in the syntax). Trying to squash all those together into a single
category assignment in the General Category partition is just the
wrong thing to do.

--Ken


Kenneth Whistler <kenw@sybase.com> on 07/10/2000 11:30:49 AM

Please respond to unicore@unicode.org

To:   "Multiple Recipients of Unicore" <unicore@unicode.org>
cc:   kenw@sybase.com
Subject:  Re: UTC Agenda item: Mathematical Letter Symbols



Mark said in response to Michael Everson:

> Unicode/10646 are peppered with duplicate characters, introduced or
> inherited for one reason or another. The math letter clones just add
more,
> so this is not a new phenomenon.
>
> That's one of the reasons we had to come up with compatibility mappings!

There are duplicates, and then there are duplicates. It is not useful
to paper over the distinction so glibly here.

The duplicates that the UTC openly claims to be duplicates are those
that we give singleton canonical mappings to, e.g. EM QUAD, EN QUAD,
the two Vietnamese tone marks, koronis, eromatiko (Greek question mark),
and the duplicated Han characters in F9XX and FAXX, among others.

The characters getting compatibility mappings are those for which some
significant distinction (often a formatting related one) from the
fundamental character or characters they are equated to is being maintained
in a source standard or other source. Hence the half-width and full-width
characters, for example. These are not truly duplicates in either the
sources nor in Unicode -- they are *kinds* of, rather than *duplicates* of.

The math alphanumerics clearly fall into the latter category, as is
demonstrated by the longstanding treatment of the small group of these
already encoded among the letterlike symbols.

--Ken

>
> Mark
>
> Michael Everson wrote:
>
> >
> > Um, there shall be no duplicate characters. (Except of course for CJK
> > radicals.)
> >
> > ...