L2/01-482

From: Kenneth Whistler [kenw@sybase.com]
Sent: Wednesday, December 19, 2001 8:33 PM

Subject: UTC Agenda Item: normativity of compatibility tags


The issue is this:

Currently, while the decomposition mappings for all Unicode
characters in UnicodeData.txt have normative status (and are
closely tied in with normalization, among other things), the
particular tags provided for compatibility mappings are still
documented as informative.

The relevant portions of the standard are:

1. Section 3.6 Decomposition, which makes it clear that the
full decompositions are normative, and which distinguishes
normatively beteween canonical and compatibility decomposition,
but which says nothing about the actual values of the compatibility
tags accompanying the compatibility mappings.

2. The Decompositions section in Chapter 14, which explains
the format of the decompositions in the code chart names list, and
which states: "Formatting information may be indicated inside angle
brackets." It then lists the actually occurring compatibility tags,
but says nothing about their normative or informative status.

3. UnicodeData.html, which duplicates the Chapter 14 list, saying
merely, "The compatibility formatting tags used are:" and then
giving the list, again without saying anything about their
normative or informative status.

4. Table 4-1 in Chapter 4, which states that Canonical and
Compatibility decompositions are normative, but which says nothing
about the compatibility tags.

5. DerivedProperties.html, a recently added file, which currently
states, about DerivedDecompositionType.txt that, "The value
'canonical' is normative; the others [i.e. compatibility tag
values --kenw] are informative."

6. PropertyValueAliases.txt, which again lists the compatibility tags,
and gives aliases for each. The documentation for this file, however,
just points back to UnicodeCharacterDatabase.html for an indication
of which properties are normative, and that file, in turn, points to
UnicodeData.html for the properties relevant to UnicodeData.txt (see
above).

As you can see, the situation is not all that clear about these
"compatibility formatting tags".

However, we are currently in the situation where these tags are 
involved in a significant number of fairly strict dependencies.

A. UTS #10 Unicode Collation Algorithm, and ISO/IEC 14651 (both
published standards) have strong dependencies on the exact
compatibility tags in UnicodeData.txt for the establishment of
all the tertiary weights in allkeys.txt (for UCA) and in the
Common Tailorable Template table (for 14651).

B. UTR #20 Unicode in XML and other Markup Languages, has an
extensive table showing how to handle various characters with
compatibility mappings -- a table whose content is dependent on
the exact values of the compatibility tags. This UTR is not a
standard, but is a TR published jointly with W3C; it is difficult
to modify, and clearly has an impact on implementation of other
standards beyond our control.

C. The existence of PropertyValueAliases.txt itself is encouraging
other standards to make use of the exact list of aliases we provide
there in reference to the compatibility tags (among others) --
which is likely to result in this list being used in XML data
(and other contexts).

D. The crosschecking of our properties for internal consistency
has, among other things started to make use of some consistency
checks against the compatibility tags to assist in determining
whether other property assignments (e.g. in LineBreak.txt,
EastAsianWidth.txt, PropList.txt, and so on) are reasonable and
consistent.

E. Implemented API's exist that return compatibility tag values
for Unicode characters. And other software has been implemented
which parses UnicodeData.txt for various purposes, and which often
makes assumptions regarding the stability, both of the number of
compatibility tags and of their exact labelled values.

All of this, in my mind, is pointing very strongly towards
normative status for the compatibility tags. In 1995/96 these might
still have had informative status, and were subject to a fair
amount of dickering. (The exact tags and their distribution changed
rather drastically between Unicode 1.1 and Unicode 2.0, for example.)
However, currently, I believe that the compatibility tags can no
longer be subject to arbitrary changes, and all new characters
that have compatibility decompositions have to be assigned tags from
the current set, to avoid breaking existing expectations and software.

Therefore, after whatever appropriate discussion ensues,
I would propose that the "compatibility formatting tags" and their
exact labels be formally considered normative properties in the
Unicode Character Database, and that all the relevant places in
the standard be updated to make this status very clear.