L2/08-206 Title: Code Point Labels -- Clarification of a Character Name Issue Source: Ken Whistler Date: May 2, 2008 ************************************************************** Background During the run-up to the release of Unicode 5.1, an issue regarding character names -- or more precisely, the values of the Unicode character property Name -- came up and saw some discussion amongst the editorial committee. The issue came to a head precisely because Unicode 5.1 saw the release of the UCD in XML, and the XML data files need an unambiguous determination of the value to be attached to the "na=??" attribute listings for code points, even in those cases where it isn't so obvious whether or not the code point even *has* a name -- namely for control codes, unassigned code points, noncharacters, etc., which we don't typically think of as having Unicode character names. The question then became just exactly what should be entered in the XML data for the na attribute for a code point like U+0009. Should it be na="" or na="" or na="" or even possibly something else? The issue is a fraught one, because Name is formally an immutable property and because it is one of the key values that is maintained in synchrony with ISO/IEC 10646 and is additionally subject to longstanding syntactic constraints limiting the allowed characters in names -- which certainly don't include "<" and ">", for example. In this document, I summarize what existing practice is and then adopting the definition of a "Code Point Label" -- as distinct from a Unicode Character Name -- as a way to both encompass current practice and needs for such labels and to avoid confusion and destabilization for the character names per se. ************************************************************** Existing Practice As it stands now, both in the Unicode Standard and in ISO/IEC 10646, only graphic and format characters officially have character names. If you refer to Table 2-3, Types of Code Points, in TUS 5.0 (p. 27), Controls (gc=Cc), Private-use (gc=Co), Surrogate code points (gc=Cs), Noncharacters (gc=Cn, Noncharacter_Code_Point=T), and Reserved unassigned (gc=Cn, Noncharacter_Code_Point=F) do not have names. This fact, however, has not prevented either standard from printing *strings* at code point locations for such code points, in the same slot that one would expect a character name to occur -- and this has occasionally been taken as an indication that those strings are in fact names for those code points. In the printed versions of the Unicode names list, we have the following conventions: Controls (gc=Cc), print the string "" Noncharacters (gc=Cn, Noncharacter_Code_Point=T), print the string "" Reserved (gc=Cn, Noncharacter_Code_Point=F), print the string "" Surrogate code points and Private-use code points are simply never listed, so have no such conventions. Traditionally, ISO/IEC 10646 has used the following conventions: Reserved, print the string "(This position shall not be used)" Noncharacters, print the string "(This position is permanently reserved)" Controls, Surrogate code points, and Private-use code points are never listed, so have no such conventions. Nobody ever mistook the 10646 strings as "names", in part because they spelled out complete sentences. But the Unicode names list conventions appear more as name-like labels -- and there is a further complication involved. The issue is this: the Unicode names list, NamesList.txt, is itself a data file that both drives the typesetting of the actual code charts for Unicode, but is also itself derived from other data files -- most importantly the core UCD data file, UnicodeData.txt. UnicodeData.txt has some conventional use of fields to assist in the derivation of NamesList.txt. To wit: Field 1 is where the normative Unicode name appears for an ordinary (Graphic or Format) character in UnicodeData.txt. So: 002C;COMMA;Po;0;CS;;;;;N;;;;; ^^^^^ However, in order to carry information about other types of assigned code points, UnicodeData.txt also contains entries for Controls, with values in field 1 corresponding to what gets printed in the code charts. So: 0009;;Cc;0;S;;;;;N;CHARACTER TABULATION;;;; ^^^^^^^^^ It also contains entries for Surrogate code points and for Private-use code points, with values in field 1 corresponding to yet another set of conventions, including strings which are *not* printed in the code charts. So: DC00;;Cs;0;L;;;;;N;;;;; ^^^^^^^^^^^^^^^^^^^^^^ E000;;Co;0;L;;;;;N;;;;; ^^^^^^^^^^^^^^^^^^^^ There are no entries whatsoever in UnicodeData.txt for Noncharacters or Reserved unassigned code points. In the derivation of NamesList.txt, the string "" for Noncharacters is inserted by the program that is used to generate NamesList.txt, rather than being parsed from UnicodeData.txt. The string "" for Reserved unassigned code points is also inserted by that program -- but only for the few code points that actually require explicit listing because they need to be there as placeholders for cross-reference annotations. Any other "" strings that appear printed in the actual code charts are inserted by an entirely distinct program, unibook, which is used for chart formatting, based on logic that handles the the formatting for ranges of unassigned code points within printed blocks. O.k., to this point what should be clear is that field 1 values in UnicodeData.txt cannot be taken verbatim as being equivalent to the values of the Name property value. Values in field 1 using angle brackets are not actually names, but serve as kinds of meta-labels for use in other conventions. ************************************************************** Unique Code Point Labels Over the years, there actually has developed yet another set of conventions for labelling code points. These are the conventions used by Mark Davis' suite of tools that generate derived property files and related data files. The pattern Mark's tools follow is to use a unique label for *every* code point, so that any listing of properties can include identical format comments for listing "names" for each code point or range of code points, whether or not it involves an ordinary character with an ordinary Unicode character name. The conventions Mark uses derived from the meta-labels long used in the names list, namely "", "", and "", but he extended and modified them to have labels for each different code point type, including surrogates and private-use, and by adding the code point as part of the label, to make them unique. Here is a summary of the actual current usage: Graphic & Format: Use Unicode character name Control: Reserved: Noncharacters: Private-Use: Surrogates: Where the NNNN gets turned into the code point in hex, using our usual 4 - 6 digit convention for code point values. This style of labelling code points uniquely has proven useful, and nobody has really objected to it. But as yet it has no official status other than simply praxis in comment fields in the UCD. A problem arose, however, when it was asserted that these unique code point labels should then be taken as the values of the na (Name) attribute in the XML data files for the UCD. ************************************************************** Summary It seems clear that at least some implementations have seen a need for having unique, identifier-like labels for *all* Unicode code points, not merely assigned Unicode characters with the familiar unique Unicode character names. I don't see any reason not to accomodate this need, but feel it is important not to have these labels confused with the actual Unicode character names. Accordingly, I'd like the UTC to nail these issues down, formally distinguish between Unicode character names and code point labels, and fully specify both. To this end, the following consists of a proposal for discussion and (I hope) adoption. ************************************************************** Proposal 1. Unicode Character Name The UCD Name property (short alias: "na") is a string property. Its value for all Graphic and Format characters is the Unicode character name as generally understood. For Graphic and Format characters other than ideographs and Hangul syllables, the name is as listed in field 1 of UnicodeData.txt. For Hangul syllables, the name is derived by rule, as specified in Section 3.12, under "Hangul Syllable Name Generation", making use of the values of the Jamo_Short_Name property. For ideographs, the name is derived by rule, by concatenating the string "CJK UNIFIED IDEOGRAPH-" or "CJK COMPATIBLITY IDEOGRAPH-" (or other as specified, e.g. "TANGUT IDEOGRAPH-") to the code point, expressed in hexadecimal, with the usual 4 to 6 digit convention. For all *other* Unicode code points of all types, the value of the UCD Name property is the null string. I.e., na="". Note that the Unicode Name property values are unique for all non-null values, but not every Unicode code point has a unique Unicode Name property value. Furthermore, the Name property value uniqueness requirement interacts with name assignment rules for formal aliases and for named character sequences: Unicode character names, formal aliases, and named character sequences constitute a single, unique namespace. As corollary to this specification, it should be noted that the value of field 1 (the string of characters between the semicolon separators) is to be taken as the normative specification of the UCD Name property only for Graphic and Format characters other than ideographs and Hangul syllables. All other values which occur in field 1 are to be understood as meta-labels that serve other functions in the generation of names lists and charts, or to label abbreviated ranges of property definitions, but do *not* constitute values of the UCD Name property. 2. Unicode Code Point Type Label For each of the seven major types of Unicode code points, there is a unique string label, as follows: Graphic: graphic Format: format Control: control Reserved: reserved Noncharacters: noncharacter Private-Use: private-use Surrogates: surrogate 3. Unicode Code Point Label The Unicode code point label is a unique label for *every* Unicode code point in the entire range: U+0000..U+10FFFF. The code point label is distinguished from the expression of the code point per se (i.e. "U+0000" or "U+0061"), which itself is also a unique identifier, as described in Appendix A, Notational Conventions. (Or see also Clause 6.5 Short identifiers for code positions (UIDs) in ISO/IEC 10646.) The Unicode code point label is a unique string value defined as follows: For any Unicode code point for which the value of the UCD Name property value is non-null, the code point label is identical to the Unicode character name. This will be the case for all Graphic and Format code points. Otherwise, the code point label is constructed as follows: Concatenate the code point type label for the code point, "-", plus the 4 to 6 digit representation of the code point. More specifically, the code point labels are as follows: Control: control-NNNN Reserved: reserved-NNNN Noncharacters: noncharacter-NNNN Private-Use: private-use-NNNN Surrogates: surrogate-NNNN When displayed in mixed contexts with Unicode character name values, to avoid any possible confusion with actual, non-null Unicode Name values, constructed Unicode code point labels are displayed between angle brackets: , , etc.