L2/00-251

From: Kenneth Whistler [kenw@sybase.com]
Sent: Thursday, August 03, 2000 4:09 PM
Subject: Re: UTC Agenda item: Mathematical Letter Symbols

Mark said:

> I am concerned about the math clone characters.

Aren't we all!

> During the long
> discussions over the years with representative of the match community,
> these characters were sold to us on the basis that they were required
> in plain-text processing. On that basis, the UTC advanced them to the
> next level, and they are now a part of the current FCD 10646-1. Cf.
> http://www.unicode.org/unicode/members/L2000/n3442/02n34421_pi-38.pdf

(When referring to this document, please make sure to pick up the
4 corrected pages that Michel had posted under n3442, since some of
the fonts with a bearing on the math alphanumerics were incorrect in
the original document 02n34421_pi-38.pdf.)

> For reasons mentioned elsewhere, they have the opportunity to cause
> not only considerable confusion among users, but problems for software
> processes, and security risks in terms of spoofing. They are all
> identical in appearance with normal letters and numbers under some
> choice of style or font, e.g.
> 
> 1D680 MATHEMATICAL MONOWIDTH CAPITAL Q
> 1D7E2 MATHEMATICAL SANS DIGIT 0
> 
> Although intended for math implementations, these characters will
> clearly leak into normal environments. If these character are to be in
> Unicode, then our goal must be to make sure that they are useful in
> their intended implementation context, but limit the damage that they
> can do elsewhere.

The confusion and security spoofing questions should be considered in
the light of the precedent which Unicode has already set for having
a set of cloned alphabets for some compatibility functionality. It is
incorrect to imply that this is something completely *new* to the
standard that we haven't already had to deal with in some way. What
I am referring to, of course, are the fullwidth and halfwidth clone
alphabets in the FFXX block:

   fullwidth ASCII and digits
   halfwidth katakana and Hangul jamos

The fullwidth ASCII and digits are "all identical in appearance with
normal letters and numbers under some choice of style or font."
And yes, they can and do cause some confusion in users when they are
"let out of the corral" in inappropriate contexts.

The difference is that the cloned fullwidth forms clearly *do* have
normal textual functions in Asian contexts, whereas the new
math alphanumeric alphabets are being proposed for much more limited
use and not for general textual use.

More on this below.

> 
> One of the tools we have to address that is to give them the correct
> properties to reflect their real status as symbols, not as letters or
> numbers. That is, assign them as So (Symbol, Other), with no numeric
> value, no case property, no case mapping.

As in much of the recent discussion about properties, this begs the
question about the status of properties. What is at stake is not
what the *real* properties of these characters are; as I have
noted all along, for all the letterlike symbols (of which these are
clearly more instances), the characters are *both* letters (or digits)
*and* symbols. That is why we called them "letterlike symbols" in the
first place.

Rather, what is at stake is what the value assignments in the General
Category partition (and case mappings) of UnicodeData.txt are, and which 
processes they are aimed to assist (and which not). The General Category 
assignments have taken on high stakes recently precisely because they are 
used normatively(?) to define identifier syntax, and because the Java and
XML communities have come to depend on that identifier syntax, but
are concerned about what should and should not be allowed for it.

So to prevent having to grind round and round and round on this, I
would like it to be possible for the UTC to stipulate that:

1. The math styled alphabet characters *are* letters, *are* cased,
   *do* have case pairings, and *do* have script identities as
   Latin or Greek.

2. The math styled digit characters *are* digits, *do* have numeric
   values, and *are* associated with the normal Arabic digits
   (U+0030..U+0039).

3. The math alphanumerics *do* function as symbols, typically as
   independent units, and do not partake of most textual functions
   appropriate to the letters that are strung together to make words
   of normal text.

If we can get past that, we can then perhaps focus on what new
assignments and/or reassignments of General Category field of the
Unicode Character Database will cause the least trouble for Java and
XML while also causing the least disruption to other implementations
or standards.

> 
> In other words, don't give them properties like letters or digits,
> such as:
> 
> 0051;LATIN CAPITAL LETTER Q;Lu;0;L;;;;;N;;;;0071;
> 0030;DIGIT ZERO;Nd;0;EN;;0;0;0;N;;;;;
> etc.
> 
> instead give them properties like other symbols:
> 
> 2118;SCRIPT CAPITAL P;So;0;ON;;;;;N;SCRIPT P;;;;
> 235C;APL FUNCTIONAL SYMBOL CIRCLE UNDERBAR;So;0;L;;;;;N;;;;;
> etc.
> 
> In particular, assigning them the value 'So' will cause them not to be
> included in the recommended programming identifier syntax. I strongly
> feel that this is the correct way to go. We don't want to have these
> clones, with all their possibilities for spoofing, to occur in
> programming identifiers, XML tag names, and Java class file names,
> etc. (Note that Java class names -- identifiers -- are mirrored in the
> file name for both the source and binary.)

I think this is the heart of the problem Mark is concerned about.

However, I think we have to admit that the cat is already out of the
bag here. Why are not:

FF21;FULLWIDTH LATIN CAPITAL LETTER A;Lu;0;L;<wide> 0041;;;;N;;;;FF41;
FF41;FULLWIDTH LATIN SMALL LETTER A;Ll;0;L;<wide> 0061;;;;N;;;FF21;;FF21

equally good candidates for spoofing as the existing:

2112;SCRIPT CAPITAL L;Lu;0;L;<font> 004C;;;;N;SCRIPT L;;;;
2113;SCRIPT SMALL L;Ll;0;L;<font> 006C;;;;N;;;;;

or the proposed new:

1D504;MATH FRAKTUR CAPITAL A;Lu;0;L;<font> 0041;;;;N;;;;;
1D51E;MATH FRAKTUR SMALL A;Ll;0;L;<font> 0061;;;;N;;;;;

If you say that well, obviously people can tell the fullwidth
versions apart from the normal versions, because they look different --
that is equally true of the script or fraktur versions. Nobody is
going to confuse a Fraktur class name with a normal class name. It
is only if you run a folding over the alphabets, to eliminate the
font/style differences that you would end up with direct confusability.
But that applies just as strongly to the fullwidth ASCII as it does
to the math style alphanumerics, as best I can tell.

Maybe we would be better off if we minimized the number of instances
in the encoding where a folding could result in a confusion like this.
But I don't think the spoofing problem is anything new here introduced
by the math alphanumerics.

After all, are we proposing to make the Turkish dotless-i a symbol
so no one could spoof a Java file name by replacing an i with a
dotless-i + combining dot above, for example? Or any other of a number
of clever "legal" spoofs that you could contrive without getting into
the math symbols at all.

> 
> Math equations will have their own rules for identifiers; those should
> not be confused with the standard recommendations for normal text
> processing. As Murray points out, "...the characters are separate
> symbols, e.g., they don't get grouped into natural language words"
> (unicode@unicode.org Mon, 17 Jul 2000)

However, in fairness, we should point to Kent's opposite point of
view, where he sees math-type style distinctions being widely used
in computer science for multi-letter *identifiers*, rather than
for variables as usually seen in math. In this instance, I think
the correct approach is to make use of normal styles and/or markup
for the computer science style types, where bolding, font shifts,
etc., are applied to generic words (and not just to A-Z, a-z), while
constraining the math style alphanumerics to usage as independent
math variable (and constant, and other) symbolic usage.

> 
> These characters should also not have case mappings -- where
> characters are treated as math symbols, case is not just a minor
> variation, they change meaning when they change case.

On this point, I completely agree. Even though there are clear
case *pairs*, I don't think the data file should list default
case *mappings* for the pairs. This is already the precedent
we have set for other letterlike symbols. See the script l's
listed above.

> 
> I realize quite well that this approach changes the direction that we
> had been following with regard to the letter-like symbols,

Mark's suggested approach would change the direction with respect to
General Category assignments, but would *not* change the direction
already established for case mappings.

> but we have
> *not* had complete copies of alphabets before, 

Fullwidth ASCII.

> so what was a small
> cyst has the prospect of becoming a malignant tumor. (Ok, the language
> is a bit overblown, but you get my point).
> 
> 
> Now there is a complication: what to do about the current letter-like
> symbols, such as:
> 
> 2112;SCRIPT CAPITAL L;Lu;0;L;<font> 004C;;;;N;SCRIPT L;;;;
> 2118;SCRIPT CAPITAL P;So;0;ON;;;;;N;SCRIPT P;;;;
> 
> This issue is important, because these letters are used to 'fill in'
> holes in the new allocations.
> 
> 1D454 MATHEMATICAL ITALIC SMALL G
> 1D455 (This position shall not be used)
> 1D456 MATHEMATICAL ITALIC SMALL I
> 
> Instead of 1D455, one is to use (I believe) the currently letterlike
> italic small h:
> 
> 210E;PLANCK CONSTANT;Ll;0;L;<font> 0068;;;;N;;;;;

Yes.

> 
> Luckily, these characters are not in frequent use, so if we need to
> change their properties at this point for consistency, we have a
> certain degree of freedom. 

Less of a degree of freedom than Mark may be implying, however. As
my previous discussion on this topic pointed out, monkeying with
the General Category at this point impacts collation and would change
our definition of identifier in such a way as to once again disconnect
it from TR 10176, which we just amended to *synch* with our definition
of identifier. Changing letterlike symbols from Lu/Ll to So would
also be *introducing* more inconsistencies of the type where application
of a compatibility folding on an otherwise non-composite character
changes its category. If we are looking for consistency in our
application of properties, we shouldn't neglect unintended consequences
that *increase* character set entropy.

Perhaps we need to go that route, but the waves of interlocking
implications are substantial.

> (This would also help to resolve some
> anomalies in having characters with case, but no case mappings:
> http://www.unicode.org/unicode/reports/tr21/charts/CaseChart7.html.)

See my comments above regarding stipulation of the facts. I don't
think we can gain enforced consistency in this area by trying to
manipulate the poor, overused General Category value.

> 
>      I am sympathetic for Ken's call to arms to more closely
>      control the properties for Unicode characters, and in
>      particular to make all the general category properties
>      normative. (Cf.
>      http://www.unicode.org/Public/UNIDATA/UnicodeData.html).
> 
>      Were it not for the looming prospect of the full set of math
>      clones, I would say just let sleeping dogs lie. However, we
>      are faced with that situation, and need to consider all the
>      ramifications. 

I have no quarrel with that point of view. I am trying to point out
some of the ramifications.

>      We can't lock the barn before making sure
>      that the horses are in their stalls. (ok, mixing metaphors)
>      Once we fix this issue, then I think we are ready to take
>      the step of making all the general category properties
>      normative.
> 
> To recapitulate, we are faced with two main choices for the math
> clones:
> 
> 1. Make the math clones symbols.

Translation: Give them the "So" General Category in the UCD. We
don't have to "make" them symbols -- they already *are* symbols
*and* letters.

> 1.a. and revise the properties for the 'filler' letter-like symbols
> for consistency.
> 1.b. and leave the letter-like symbols as is, accept the
> inconsistency.
> 1.c. and leave the letter-like symbols as is, fill in the holes such
> as 1D455.

Of these, I consider 1.c *completely* unacceptable, as it would constitute
the intentional introduction into the standard of 25 utter duplicates
as bad as the Ohm sign and Angstrom sign. The UTC already decided to
leave those holes, and heading down the path of 1.c would clearly
require a reconsideration vote, as far as I am concerned.

1.b would have the least impact on any *existing* code, tables, or
standards. So of these 3 choices, 1.b is clearly the conservative
choice. It would, however, introduce a principled inconsistency
between the letterlike symbols of 21XX and the letterlike symbols
of 1D400..1D7FF. People need to consider if they can live with the
implications of that inconsistency. Among those implications, all
things else staying the same, is that the letterlike symbols of 21XX
would be valid in identifiers and the new alphanumeric letterlike
symbols on Plane 2 would not.

Option 1.a disrupts the most, by deliberately changing the category
of existing letterlike symbols. It would change the behavior of existing
API's, change the class assignments of these characters in identifiers
(or categories related to identifier syntax), and would impact the
implementation of the code now generating weight tables for collation.

> 
> 2. Make the math clones like the current letter-like symbols.

Unlike for Mark, this is my own strong preference. It has the combined
virtues of no disruption of current category assignments and
consistency of assignments for characters that are clearly intended
to fill out the complementary set against the existing letterlike
symbols.

If the goal here is to keep 1D400..1D7FF out of Java and XML
identifiers, I have yet to be convinced why this couldn't be handled
by another simple production rule that directly excluded 1D400..1D7FF
from the allowed members of the identifer_start class.

If the concern is that *no* letterlike symbol should be allowed in
an identifier, that adjust the identifier syntax accordingly. This
would require revisiting TR10176 and would require people to adjust
their implementations to update against the revised statement of
identifier syntax, but would have less significant ramifications than
Option 1.a above.

What am I missing here? What other significant processes would be
benefited so much by changing the existing letterlike symbols from Lu/Ll
to So, or would be significantly harmed by the assignment of Lu/Ll
to the letterlike symbols of the math alphanumerics?

--Ken

> 
> To limit the damage that these characters do, I strongly feel that we
> should choose #1. I have my favorite among 1a, 1b, and 1c, but any
> would be better than #2.
> 
> Mark
> 
>