L2/00-251 From: Kenneth Whistler [kenw@sybase.com] Sent: Thursday, August 03, 2000 4:09 PM Subject: Re: UTC Agenda item: Mathematical Letter Symbols Mark said: > I am concerned about the math clone characters. Aren't we all! > During the long > discussions over the years with representative of the match community, > these characters were sold to us on the basis that they were required > in plain-text processing. On that basis, the UTC advanced them to the > next level, and they are now a part of the current FCD 10646-1. Cf. > http://www.unicode.org/unicode/members/L2000/n3442/02n34421_pi-38.pdf (When referring to this document, please make sure to pick up the 4 corrected pages that Michel had posted under n3442, since some of the fonts with a bearing on the math alphanumerics were incorrect in the original document 02n34421_pi-38.pdf.) > For reasons mentioned elsewhere, they have the opportunity to cause > not only considerable confusion among users, but problems for software > processes, and security risks in terms of spoofing. They are all > identical in appearance with normal letters and numbers under some > choice of style or font, e.g. > > 1D680 MATHEMATICAL MONOWIDTH CAPITAL Q > 1D7E2 MATHEMATICAL SANS DIGIT 0 > > Although intended for math implementations, these characters will > clearly leak into normal environments. If these character are to be in > Unicode, then our goal must be to make sure that they are useful in > their intended implementation context, but limit the damage that they > can do elsewhere. The confusion and security spoofing questions should be considered in the light of the precedent which Unicode has already set for having a set of cloned alphabets for some compatibility functionality. It is incorrect to imply that this is something completely *new* to the standard that we haven't already had to deal with in some way. What I am referring to, of course, are the fullwidth and halfwidth clone alphabets in the FFXX block: fullwidth ASCII and digits halfwidth katakana and Hangul jamos The fullwidth ASCII and digits are "all identical in appearance with normal letters and numbers under some choice of style or font." And yes, they can and do cause some confusion in users when they are "let out of the corral" in inappropriate contexts. The difference is that the cloned fullwidth forms clearly *do* have normal textual functions in Asian contexts, whereas the new math alphanumeric alphabets are being proposed for much more limited use and not for general textual use. More on this below. > > One of the tools we have to address that is to give them the correct > properties to reflect their real status as symbols, not as letters or > numbers. That is, assign them as So (Symbol, Other), with no numeric > value, no case property, no case mapping. As in much of the recent discussion about properties, this begs the question about the status of properties. What is at stake is not what the *real* properties of these characters are; as I have noted all along, for all the letterlike symbols (of which these are clearly more instances), the characters are *both* letters (or digits) *and* symbols. That is why we called them "letterlike symbols" in the first place. Rather, what is at stake is what the value assignments in the General Category partition (and case mappings) of UnicodeData.txt are, and which processes they are aimed to assist (and which not). The General Category assignments have taken on high stakes recently precisely because they are used normatively(?) to define identifier syntax, and because the Java and XML communities have come to depend on that identifier syntax, but are concerned about what should and should not be allowed for it. So to prevent having to grind round and round and round on this, I would like it to be possible for the UTC to stipulate that: 1. The math styled alphabet characters *are* letters, *are* cased, *do* have case pairings, and *do* have script identities as Latin or Greek. 2. The math styled digit characters *are* digits, *do* have numeric values, and *are* associated with the normal Arabic digits (U+0030..U+0039). 3. The math alphanumerics *do* function as symbols, typically as independent units, and do not partake of most textual functions appropriate to the letters that are strung together to make words of normal text. If we can get past that, we can then perhaps focus on what new assignments and/or reassignments of General Category field of the Unicode Character Database will cause the least trouble for Java and XML while also causing the least disruption to other implementations or standards. > > In other words, don't give them properties like letters or digits, > such as: > > 0051;LATIN CAPITAL LETTER Q;Lu;0;L;;;;;N;;;;0071; > 0030;DIGIT ZERO;Nd;0;EN;;0;0;0;N;;;;; > etc. > > instead give them properties like other symbols: > > 2118;SCRIPT CAPITAL P;So;0;ON;;;;;N;SCRIPT P;;;; > 235C;APL FUNCTIONAL SYMBOL CIRCLE UNDERBAR;So;0;L;;;;;N;;;;; > etc. > > In particular, assigning them the value 'So' will cause them not to be > included in the recommended programming identifier syntax. I strongly > feel that this is the correct way to go. We don't want to have these > clones, with all their possibilities for spoofing, to occur in > programming identifiers, XML tag names, and Java class file names, > etc. (Note that Java class names -- identifiers -- are mirrored in the > file name for both the source and binary.) I think this is the heart of the problem Mark is concerned about. However, I think we have to admit that the cat is already out of the bag here. Why are not: FF21;FULLWIDTH LATIN CAPITAL LETTER A;Lu;0;L; 0041;;;;N;;;;FF41; FF41;FULLWIDTH LATIN SMALL LETTER A;Ll;0;L; 0061;;;;N;;;FF21;;FF21 equally good candidates for spoofing as the existing: 2112;SCRIPT CAPITAL L;Lu;0;L; 004C;;;;N;SCRIPT L;;;; 2113;SCRIPT SMALL L;Ll;0;L; 006C;;;;N;;;;; or the proposed new: 1D504;MATH FRAKTUR CAPITAL A;Lu;0;L; 0041;;;;N;;;;; 1D51E;MATH FRAKTUR SMALL A;Ll;0;L; 0061;;;;N;;;;; If you say that well, obviously people can tell the fullwidth versions apart from the normal versions, because they look different -- that is equally true of the script or fraktur versions. Nobody is going to confuse a Fraktur class name with a normal class name. It is only if you run a folding over the alphabets, to eliminate the font/style differences that you would end up with direct confusability. But that applies just as strongly to the fullwidth ASCII as it does to the math style alphanumerics, as best I can tell. Maybe we would be better off if we minimized the number of instances in the encoding where a folding could result in a confusion like this. But I don't think the spoofing problem is anything new here introduced by the math alphanumerics. After all, are we proposing to make the Turkish dotless-i a symbol so no one could spoof a Java file name by replacing an i with a dotless-i + combining dot above, for example? Or any other of a number of clever "legal" spoofs that you could contrive without getting into the math symbols at all. > > Math equations will have their own rules for identifiers; those should > not be confused with the standard recommendations for normal text > processing. As Murray points out, "...the characters are separate > symbols, e.g., they don't get grouped into natural language words" > (unicode@unicode.org Mon, 17 Jul 2000) However, in fairness, we should point to Kent's opposite point of view, where he sees math-type style distinctions being widely used in computer science for multi-letter *identifiers*, rather than for variables as usually seen in math. In this instance, I think the correct approach is to make use of normal styles and/or markup for the computer science style types, where bolding, font shifts, etc., are applied to generic words (and not just to A-Z, a-z), while constraining the math style alphanumerics to usage as independent math variable (and constant, and other) symbolic usage. > > These characters should also not have case mappings -- where > characters are treated as math symbols, case is not just a minor > variation, they change meaning when they change case. On this point, I completely agree. Even though there are clear case *pairs*, I don't think the data file should list default case *mappings* for the pairs. This is already the precedent we have set for other letterlike symbols. See the script l's listed above. > > I realize quite well that this approach changes the direction that we > had been following with regard to the letter-like symbols, Mark's suggested approach would change the direction with respect to General Category assignments, but would *not* change the direction already established for case mappings. > but we have > *not* had complete copies of alphabets before, Fullwidth ASCII. > so what was a small > cyst has the prospect of becoming a malignant tumor. (Ok, the language > is a bit overblown, but you get my point). > > > Now there is a complication: what to do about the current letter-like > symbols, such as: > > 2112;SCRIPT CAPITAL L;Lu;0;L; 004C;;;;N;SCRIPT L;;;; > 2118;SCRIPT CAPITAL P;So;0;ON;;;;;N;SCRIPT P;;;; > > This issue is important, because these letters are used to 'fill in' > holes in the new allocations. > > 1D454 MATHEMATICAL ITALIC SMALL G > 1D455 (This position shall not be used) > 1D456 MATHEMATICAL ITALIC SMALL I > > Instead of 1D455, one is to use (I believe) the currently letterlike > italic small h: > > 210E;PLANCK CONSTANT;Ll;0;L; 0068;;;;N;;;;; Yes. > > Luckily, these characters are not in frequent use, so if we need to > change their properties at this point for consistency, we have a > certain degree of freedom. Less of a degree of freedom than Mark may be implying, however. As my previous discussion on this topic pointed out, monkeying with the General Category at this point impacts collation and would change our definition of identifier in such a way as to once again disconnect it from TR 10176, which we just amended to *synch* with our definition of identifier. Changing letterlike symbols from Lu/Ll to So would also be *introducing* more inconsistencies of the type where application of a compatibility folding on an otherwise non-composite character changes its category. If we are looking for consistency in our application of properties, we shouldn't neglect unintended consequences that *increase* character set entropy. Perhaps we need to go that route, but the waves of interlocking implications are substantial. > (This would also help to resolve some > anomalies in having characters with case, but no case mappings: > http://www.unicode.org/unicode/reports/tr21/charts/CaseChart7.html.) See my comments above regarding stipulation of the facts. I don't think we can gain enforced consistency in this area by trying to manipulate the poor, overused General Category value. > > I am sympathetic for Ken's call to arms to more closely > control the properties for Unicode characters, and in > particular to make all the general category properties > normative. (Cf. > http://www.unicode.org/Public/UNIDATA/UnicodeData.html). > > Were it not for the looming prospect of the full set of math > clones, I would say just let sleeping dogs lie. However, we > are faced with that situation, and need to consider all the > ramifications. I have no quarrel with that point of view. I am trying to point out some of the ramifications. > We can't lock the barn before making sure > that the horses are in their stalls. (ok, mixing metaphors) > Once we fix this issue, then I think we are ready to take > the step of making all the general category properties > normative. > > To recapitulate, we are faced with two main choices for the math > clones: > > 1. Make the math clones symbols. Translation: Give them the "So" General Category in the UCD. We don't have to "make" them symbols -- they already *are* symbols *and* letters. > 1.a. and revise the properties for the 'filler' letter-like symbols > for consistency. > 1.b. and leave the letter-like symbols as is, accept the > inconsistency. > 1.c. and leave the letter-like symbols as is, fill in the holes such > as 1D455. Of these, I consider 1.c *completely* unacceptable, as it would constitute the intentional introduction into the standard of 25 utter duplicates as bad as the Ohm sign and Angstrom sign. The UTC already decided to leave those holes, and heading down the path of 1.c would clearly require a reconsideration vote, as far as I am concerned. 1.b would have the least impact on any *existing* code, tables, or standards. So of these 3 choices, 1.b is clearly the conservative choice. It would, however, introduce a principled inconsistency between the letterlike symbols of 21XX and the letterlike symbols of 1D400..1D7FF. People need to consider if they can live with the implications of that inconsistency. Among those implications, all things else staying the same, is that the letterlike symbols of 21XX would be valid in identifiers and the new alphanumeric letterlike symbols on Plane 2 would not. Option 1.a disrupts the most, by deliberately changing the category of existing letterlike symbols. It would change the behavior of existing API's, change the class assignments of these characters in identifiers (or categories related to identifier syntax), and would impact the implementation of the code now generating weight tables for collation. > > 2. Make the math clones like the current letter-like symbols. Unlike for Mark, this is my own strong preference. It has the combined virtues of no disruption of current category assignments and consistency of assignments for characters that are clearly intended to fill out the complementary set against the existing letterlike symbols. If the goal here is to keep 1D400..1D7FF out of Java and XML identifiers, I have yet to be convinced why this couldn't be handled by another simple production rule that directly excluded 1D400..1D7FF from the allowed members of the identifer_start class. If the concern is that *no* letterlike symbol should be allowed in an identifier, that adjust the identifier syntax accordingly. This would require revisiting TR10176 and would require people to adjust their implementations to update against the revised statement of identifier syntax, but would have less significant ramifications than Option 1.a above. What am I missing here? What other significant processes would be benefited so much by changing the existing letterlike symbols from Lu/Ll to So, or would be significantly harmed by the assignment of Lu/Ll to the letterlike symbols of the math alphanumerics? --Ken > > To limit the damage that these characters do, I strongly feel that we > should choose #1. I have my favorite among 1a, 1b, and 1c, but any > would be better than #2. > > Mark > >