L2/01-482 From: Kenneth Whistler [kenw@sybase.com] Sent: Wednesday, December 19, 2001 8:33 PM Subject: UTC Agenda Item: normativity of compatibility tags The issue is this: Currently, while the decomposition mappings for all Unicode characters in UnicodeData.txt have normative status (and are closely tied in with normalization, among other things), the particular tags provided for compatibility mappings are still documented as informative. The relevant portions of the standard are: 1. Section 3.6 Decomposition, which makes it clear that the full decompositions are normative, and which distinguishes normatively beteween canonical and compatibility decomposition, but which says nothing about the actual values of the compatibility tags accompanying the compatibility mappings. 2. The Decompositions section in Chapter 14, which explains the format of the decompositions in the code chart names list, and which states: "Formatting information may be indicated inside angle brackets." It then lists the actually occurring compatibility tags, but says nothing about their normative or informative status. 3. UnicodeData.html, which duplicates the Chapter 14 list, saying merely, "The compatibility formatting tags used are:" and then giving the list, again without saying anything about their normative or informative status. 4. Table 4-1 in Chapter 4, which states that Canonical and Compatibility decompositions are normative, but which says nothing about the compatibility tags. 5. DerivedProperties.html, a recently added file, which currently states, about DerivedDecompositionType.txt that, "The value 'canonical' is normative; the others [i.e. compatibility tag values --kenw] are informative." 6. PropertyValueAliases.txt, which again lists the compatibility tags, and gives aliases for each. The documentation for this file, however, just points back to UnicodeCharacterDatabase.html for an indication of which properties are normative, and that file, in turn, points to UnicodeData.html for the properties relevant to UnicodeData.txt (see above). As you can see, the situation is not all that clear about these "compatibility formatting tags". However, we are currently in the situation where these tags are involved in a significant number of fairly strict dependencies. A. UTS #10 Unicode Collation Algorithm, and ISO/IEC 14651 (both published standards) have strong dependencies on the exact compatibility tags in UnicodeData.txt for the establishment of all the tertiary weights in allkeys.txt (for UCA) and in the Common Tailorable Template table (for 14651). B. UTR #20 Unicode in XML and other Markup Languages, has an extensive table showing how to handle various characters with compatibility mappings -- a table whose content is dependent on the exact values of the compatibility tags. This UTR is not a standard, but is a TR published jointly with W3C; it is difficult to modify, and clearly has an impact on implementation of other standards beyond our control. C. The existence of PropertyValueAliases.txt itself is encouraging other standards to make use of the exact list of aliases we provide there in reference to the compatibility tags (among others) -- which is likely to result in this list being used in XML data (and other contexts). D. The crosschecking of our properties for internal consistency has, among other things started to make use of some consistency checks against the compatibility tags to assist in determining whether other property assignments (e.g. in LineBreak.txt, EastAsianWidth.txt, PropList.txt, and so on) are reasonable and consistent. E. Implemented API's exist that return compatibility tag values for Unicode characters. And other software has been implemented which parses UnicodeData.txt for various purposes, and which often makes assumptions regarding the stability, both of the number of compatibility tags and of their exact labelled values. All of this, in my mind, is pointing very strongly towards normative status for the compatibility tags. In 1995/96 these might still have had informative status, and were subject to a fair amount of dickering. (The exact tags and their distribution changed rather drastically between Unicode 1.1 and Unicode 2.0, for example.) However, currently, I believe that the compatibility tags can no longer be subject to arbitrary changes, and all new characters that have compatibility decompositions have to be assigned tags from the current set, to avoid breaking existing expectations and software. Therefore, after whatever appropriate discussion ensues, I would propose that the "compatibility formatting tags" and their exact labels be formally considered normative properties in the Unicode Character Database, and that all the relevant places in the standard be updated to make this status very clear.