L2/00-240 Kenneth Whistler on 07/07/2000 12:13:38 PM Please respond to unicore@unicode.org Subject: Re: UTC Agenda item: Mathematical Letter Symbols Mark suggested: > There are two topics that we need to cover at the next meeting having > to do with the Mathematical Letter Symbols > > 1. Do we recommend the use of these characters in rich-text > environments: in environments like MathML that have rich enough > structure to encode the proper information (and more, of course)? > > 2. Do we categorize these characters as Letters or as Symbols? I have no quarrel with the need to take up this discussion at the UTC meeting. We do need to make a determination on these. However, I disagree with Mark's conclusions here. > > Here are my thoughts on them. > > 1. Markup. > > Fundamentally, once the characters are encoded in Unicode 3.1, and are > used in accordance with their plaintext semantics, their use is > conformant even in environments where they would be better replaced by > markup or other out-of-band information. So in some sense, the only > thing the UTC can do is make a recommendation. However, we should try > to give guidance on the use of these characters and their interaction > with markup. Since mathematics (except for fragments) has > fundamentally a non-linear structure, thus requiring markup or > equivalent for correct representation, and since mathematics is > fundamentally generative (with some inventive mathematicians somewhere > using some interesting glyphs to convey some distinction), I think our > recommendation should be to replace the clones with markup in > interchange. It is my understanding that both the MathML community and the math layout software companies want these encoded as characters precisely so they don't need to do markup (or apply styles) to these. They want to be able to have these a primitive entities in the underlying representation -- i.e. as characters. Making the recommendation to replace the alphanumeric symbols with markup in interchange is contra to the reason for encoding them in the first place. Note that your recommendation would also be tantamount to a recommendation to replace the letterlike symbols (in block 21XX) with markup in interchange -- and that is starting down the very slippery slope of trying to get people to replace the use of compatibility characters with marked-up Cleanicode. I realize that UTR #20 is having to confront these issues, in making recommendations for use of Unicode in XML and other markup languages. However, it is one thing to recommend the non-use of certain characters in a context *when you are using* a markup language for interchange. It is another thing to generically recommend the non-use of certain characters and their replacement by markup in interchange. Further, I think we may need to distinguish different situations even when markup languages are being used. Recommending the non-use of the alphanumeric symbols in MathML, when the MathML designers *want* to use them instead of markup, seems perverse to me. > > 2. Symbols > The only basis for adding these characters are that they are NOT > treated as letters -- that they are treated as symbols. This is manifestly not the case. Yes, they are treated as symbols, but that is not the "only basis" for adding them as characters. They are encoded as characters for compatibility with existing practice in Mathematica. They are encoded as characters to avoid having to encode combining math style marks as characters. They are encoded as characters to have basic entities to make textual distinctions used by mathematicians without having to introduce style markup to maintain those distinctions. > Categorizing > them as Sm -- mathematical symbols -- will result in more applications > correctly handing them, and distinguishing them from the true letters. > > For consistency, we should revisit the few scattered characters in the > BMP that are filling holes in the math characters, as listed in > http://www.unicode.org/unicode/reports/tr24/charts/ScriptChart0.html. As is often the case, this is a good idea just begging for trouble. Effectively, Mark is arguing here to give up on the category assignments for letterlike symbols that have been in the standard since Unicode 2.0, to change the Lu's and Ll's in that set to Sm's. Here are examples from UnicodeData-2.0.14.txt (the release version for Unicode 2.0): 2102;DOUBLE-STRUCK CAPITAL C;Lu;0;ON; 0043;;;;N;DOUBLE-STRUCK C;;;; 2108;SCRUPLE;So;0;ON;;;;;N;;;;; Note that way back when, for Unicode update version 2.1.5, the bidi categories for these anomalous Lu and Ll letterlike symbols were corrected, to make them consistent with other letters. This was a decision that UTC made explicitly. Here is UnicodeData-2.1.5.txt: 2102;DOUBLE-STRUCK CAPITAL C;Lu;0;L; 0043;;;;N;DOUBLE-STRUCK C;;;; 2108;SCRUPLE;So;0;ON;;;;;N;;;;; So Lurking Problem #1 is that changing the general category from Lu or Ll for these will reintroduce the problem of inconsistency in their bidirectional handling. Should the L's all be changed back to ON's if these are changed to Sm's? Any change in bidi properties of the existing repertoire now would have consequences for existing implementations. Lurking Problem #2 is that the issue extends beyond the set of characters that were omitted from the repertoire of new alphanumeric symbols because they were already encoded as letterlike symbols. In particular, U+2107 EULER CONSTANT, U+210F PLANCK CONSTANT OVER TWO PI, and the 4 Hebrew symbols U+2135 ALEF SYMBOL .. U+2138 DALET SYMBOL also have Lu, Ll, or Lo general categories (and bidi category L). I am presuming that Mark, for consistency, would want to switch these also to Sm, even though there is no complementarity issue here with the new alphanumeric alphabets on Plane 1. Lurking Problem #3 is that this change has ramification for identifiers. Lu, Ll, and Lo are all among the general category values recommended for inclusion in identifiers. Sm (and So) are *not*. (See page 135 of the Unicode Standard.) So at this point, any changes from Lx to Sx for a character in the UCD constitutes a recommendation to change the acceptable repertoire for identifiers. That will impact some implementations -- and it will put the Unicode Standard recommendation out of synch, once again, with Annex A to ISO TR 10176, after we just went through the exercise of pushing through an Amendment to that TR, so that they *would* be in synch. Lurking Problem #4 is that the proposed change would impact case tables. Changing Lu or Ll to Sm implies that we are recommending that the character no longer be considered upper- or lower-case. True, none of these letterlike symbols have case *mappings* now. But any API which is currently returning True for isuppercase() or islowercase() for these letters, should return False after this change -- meaning changes in tables. Further, I should remind people that the math property for characters cannot be predicted from the general category value in the UCD anyway. So if the issue is consistency of detection and behavior for these characters by *mathematical* applications, then changing their general category from Lu or Ll to Sm is basically moot. Math applications also have to detect regular Greek letters (Lu, Ll) and lots of punctuation with a whole variety of general category assignments. The only correct handling consistency argument that I see holding any water is for non-mathematical, general text applications to consistently determine that the math alphanumerics and the relevant letterlike symbols are *not* "true letters", as Mark puts it. And to do that, rather than jiggering, once again, the sorely overloaded and imprecise General Category values in the UCD, we should be examining the math property for consistency instead. If we simply corrected that listing, so that all of these letterlike symbols that pattern with the math alphanumerics are consistently given the math property (along with the 6 other outliers in the letterlike symbols I noted above), then a general application has a way of telling these things apart by property, as long as the math property is made available to it. > > I feel sufficiently strongly about this that if we cannot agree to > change the few scattered characters, that we should go ahead and fill > the holes, mark them all as Sm, and discourage the use of the > scattered ones. I wish this were stated as more than a sufficiently strong feeling. What are the envisioned implementation problems or textual interpretation catastrophes that make the current property assignments untenable? Why would not a review and correction of the math property assignment not be a sufficient solution, without introducing the lurking problems associated with changing a normative Lu or Ll assignment? And what is so wrong with the current situation that it would lead you to countenance: A. Encoding 25 characters that would manifestly be duplicates for 25 that we already have -- after both the UTC and WG2 took explicit decisions to omit those 25 from the Plane 1 repertoire for the math alphanumerics because they were already encoded. B. Deprecate 25 characters that people are already using in their intended usage, and tell them to use 25 others instead. One of the conclusions that this is leading me (and I suspect Asmus) to is that it is time to stop the fiction that some General Category assignments are not normative. At this point any change to any value of General Category in the UCD can impact many implementations. It is time for the UTC to simply declare *all* values of that field to be normative, but then to clarify what conformance to the General Category values means. We have crossed a watershed for UnicodeData.txt, in particular. Just as we once accepted wholesale name changes for characters in the Unicode Standard (when no one was depending on their values, and when there was a good reason to do so), but now refuse to change *any* name, we are approaching the point where making changes to values in UnicodeData.txt is no longer a matter of "fixing" things, because any change breaks as many things as it might fix. The interdependencies have grown. Secondly, I want to point out yet again the thing I have been harping on for several years now. The General Category field in UnicodeData.txt is overloaded and ill-designed for solving all character property-related problems. It serves a bunch of useful purposes, but we cannot keep loading it with more and more requirements and keep expecting to get all program behavior to be consistent by jiggering the assignments of characters here and there. The fact is that U+2102 DOUBLE-STRUCK CAPITAL C *is* a letter, and *is* uppercase. It also functions as a symbol, and in particular as a math symbol. It also may be appropriate as an identifier (certainly for an implementation of a mathematical algebra that treated mathematical variables as formal identifiers in the syntax). Trying to squash all those together into a single category assignment in the General Category partition is just the wrong thing to do. --Ken Kenneth Whistler on 07/10/2000 11:30:49 AM Please respond to unicore@unicode.org To: "Multiple Recipients of Unicore" cc: kenw@sybase.com Subject: Re: UTC Agenda item: Mathematical Letter Symbols Mark said in response to Michael Everson: > Unicode/10646 are peppered with duplicate characters, introduced or > inherited for one reason or another. The math letter clones just add more, > so this is not a new phenomenon. > > That's one of the reasons we had to come up with compatibility mappings! There are duplicates, and then there are duplicates. It is not useful to paper over the distinction so glibly here. The duplicates that the UTC openly claims to be duplicates are those that we give singleton canonical mappings to, e.g. EM QUAD, EN QUAD, the two Vietnamese tone marks, koronis, eromatiko (Greek question mark), and the duplicated Han characters in F9XX and FAXX, among others. The characters getting compatibility mappings are those for which some significant distinction (often a formatting related one) from the fundamental character or characters they are equated to is being maintained in a source standard or other source. Hence the half-width and full-width characters, for example. These are not truly duplicates in either the sources nor in Unicode -- they are *kinds* of, rather than *duplicates* of. The math alphanumerics clearly fall into the latter category, as is demonstrated by the longstanding treatment of the small group of these already encoded among the letterlike symbols. --Ken > > Mark > > Michael Everson wrote: > > > > > Um, there shall be no duplicate characters. (Except of course for CJK > > radicals.) > > > > ...