L2/09-238 Date: Sat, 11 Jul 2009 21:35:40 -0700 From: Asmus Freytag Subject: Issues with the new casing related properties in the beta DerivedCoreProperties I noticed that the Is_xxxCase properties in DerivedCoreProperties.txt for Unicode 5.2.0 beta use the _opposite_ convention from all the other binary properties (suddenly the "YES" value is the default, and the "NO" value is listed with two columns). The UTC should take a hard look at this and change the way these are listed so that all Boolean properties use the same convention in this file. It should be possible to utilize these new values in 5.2.0 without adding new tricks to simple parsers for boolean values. Beyond parsing issues, the current way of "reverse" listing of these properties in the beta draft also hides the fact that their definition (as expressed in their naming) is anything but intuitive. Most users would be very surprised to see an API return *both* "isUppercase==true" and "isLowercase==true" for all Han, Hangul and Hieroglyphs. The values of these new properties clearly indicate that the intent is to denote when characters are affected by upper/lower or titlecase *mappings*. However naming the property "Is_xxxCase" implies something a bit different, namely an *affirmative statement* about the nature of case of the character, which is inappropriate for characters that are not part of casing scripts. The counterintuitive nature of the draft names for these properties can perhaps best be understood if you realize that in addition to characters being called both "Is_Uppercase" and "Is_Lowercase" you have characters that are also "Case_Ignorable" on top of that, and that characters that are Uppercase in another section of the same file are not also "Is_Uppercase" Using intuitive names for properties, or at least names that are not totally counterintuitive is very important as it increases the chance that these new properties will actually be implemented correctly. As it stands, these properties constitute a potential trap for the unwary. Simply pointing to the definitions in chapter 3 as a motivation for these strange Booleans is not particularly helpful. What those definitions describe are mapping functions IsUppercase(x)that are defined and described in a specialized context: Default Case Detection. In such a well-constrained context, the overly generic naming of these specialized operations was perhaps sloppy but not a major problem. It becomes a serious issue if it starts to contaminate global name spaces (property aliases) where there is no guiding context. For the DerivedCoreProperties file as drafted for the beta, the context is available, but it is buried in a comment. APIs will be written that simply return an "IsUpperCase" bit. In such a situation, the reuse of the generic name, intuitively implying "this is an uppercase character in the conventional sense" could be quite harmful. Therefore, this should be corrected before these properties become enshrined in a published release. The simplest fix for both the naming and parsing issue would be if the properties were renamed to something like "Not_Uppercase" and the values changed from NO to YES. The listing should then follow the usual convention for boolean values, which is to use a single column for all values that are "true" and to elide those values that are "false". This can be trivially related to the operations defined in the Default Case Detection, by stating in both a comment and an annotation in the appropriate places that "IsUpperCase(x) in that definition is true for all characters that have the NO value for Not_Uppercase" and so on. With this change, and without the need to fiddle with the wording of those definitions, the properties will not be misapplied when they are encountered by implementers in an API documentation or other situation where the relation to their definitions isn't readily apparent or available. (If the renaming is adopted, there should be no change in the actual character ranges listed) This comment does not suggest to change the underlying classification of the characters, only their presentation as surfaced in both property aliases and nature of their listing in the data file. A./