L2/08-263 Title: Code Point Labels -- Suggested Wording Details Source: Ken Whistler Date: July 25, 2008 Ref: L2/08-206, L2/08-249 ************************************************************** Background At UTC #115, I presented the issues regarding the need for definition of code points labels (and clarifications regarding the exact values of Unicode character names). I won't repeat all the background here. For that see L2/08-206. The UTC approved the general proposal and tasked me in AI 115-A036 to "Draft documentation of the concept of a code point label distinct from Unicode character names, for a future version of the standard." I have drafted part of that documentation for the proposed update for UAX #44, "Unicode Character Database," which is posted for separate discussion and review. See L2/08-249. The remainder of this document represents my suggested disposition of the rest of the text, adapted from L2/08-206 proposal text, and with some indications of positions in the text of the standard. What I am suggesting now is that if this generic disposition meets with committee approval, that the text be remanded to the editorial committee for the detailed editorial work for eventual insertion into the text of Unicode 5.2 (or Unicode 6.0). ************************************************************** [[ As a 4th bullet under definition D4 Character Name in Chapter 3, insert ]] * The detailed specification of the Unicode character names, including rules for derivation of some ranges of characters, is given in Section 4.8, "Name -- Normative". That section also describes the relationship between the normative value of the Name property and the contents of the corresponding data field in UnicodeData.txt in the Unicode Character Database. [[Incorporate the following text in Section 4.8, "Name -- Normative", as a subsection, with appropriate editorial adjustments to other existing text in that section. ]] Unicode Character Name The Name property (short alias: "na") is a string property. Its value for all Graphic and Format characters is the Unicode character name as generally understood. For Graphic and Format characters other than ideographs and Hangul syllables, the name is as listed in field 1 of UnicodeData.txt. For Hangul syllables, the name is derived by rule, as specified in Section 3.12, under "Hangul Syllable Name Generation", making use of the values of the Jamo_Short_Name property. For ideographs, the name is derived by rule, by concatenating the string "CJK UNIFIED IDEOGRAPH-" or "CJK COMPATIBILITY IDEOGRAPH-" (or other as specified, e.g. "TANGUT IDEOGRAPH-") to the code point, expressed in hexadecimal, with the usual 4 to 6 digit convention. The exact ranges subject to these name derivations are specified by a name range convention used in field 1 of UnicodeData.txt. For all *other* Unicode code points of all types, the value of the UCD Name property is the null string. In other words, na="". Note that the Unicode Name property values are unique for all non-null values, but not every Unicode code point has a unique Unicode Name property value. Furthermore, the Name property value uniqueness requirement interacts with name assignment rules for formal aliases and for named character sequences: Unicode character names, formal aliases, and named character sequences constitute a single, unique namespace. As corollary to this specification, it should be noted that the value of field 1 (the string of characters between the semicolon separators) is to be taken as the normative specification of the UCD Name property only for Graphic and Format characters other than ideographs and Hangul syllables. All other values which occur in field 1 are to be understood as meta-labels that serve other functions in the generation of names lists and charts, or to label abbreviated ranges of property definitions, but do *not* constitute values of the UCD Name property per se. [[ In TUS 5.0, on page 79, after the existing definition D10 Code Point, insert the following new definitions. ]] D10a Code Point Type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved. * See Table 2-3, "Types of Code Points" for a summary of the meaning and use of each class. * For Noncharacter, see also D14 Noncharacter. * For Reserved, see also D15 Reserved code point. * For Private-Use, see also D49 Private-use code point. * For Surrogate, see also D71 High-surrogate code point and D73 Low-surrogate code point. D10b Code Point Type Label: A unique label for each code point type. * Each code point type label is a lowercase string, defined according to the following table. [[ Insert as table. Caption: Code Point Type Labels ]] Type Label ========================== Graphic graphic Format format Control control Reserved reserved Noncharacter noncharacter Private-Use private-use Surrogate surrogate D10c Code Point Label: A unique label for each code point in the Unicode codespace. [[ Edit the following specification for the code point label to an appropriate set of bullets and/or body text, to fill out the definition. ]] The code point label is distinguished from the expression of the code point per se (for example, "U+0000" or "U+0061"), which itself is also a unique identifier, as described in Appendix A, Notational Conventions. (See also Clause 6.5 Short identifiers for code positions (UIDs) in ISO/IEC 10646.) The Unicode code point label is a unique string value defined as follows: For any Unicode code point for which the value of the UCD Name property value is non-null, the code point label is identical to the Unicode character name. This will be the case for all Graphic and Format code points. Otherwise, the code point label is constructed as follows: Concatenate the code point type label for the code point, "-", plus the 4 to 6 digit representation of the code point. [[ Insert as table. Caption: Construction of Code Point Labels ]] Type Label ================================= Control control-NNNN Reserved reserved-NNNN Noncharacter noncharacter-NNNN Private-Use private-use-NNNN Surrogate surrogate-NNNN When displayed in mixed contexts with Unicode character name values, to avoid any possible confusion with actual, non-null Unicode Name values, constructed Unicode code point labels are displayed between angle brackets: , , etc.