Public Review Issue #129
Code Point Labels -- Suggested Wording Details


At UTC #116, a decision was taken to prepare a Public Review Issue on the topic of Code Point Labels.

The following material consists of the suggested edits to the Unicode Standard to accomplish the formal introduction of Code Point Labels. This text is taken from L2/08-263, with minor additions as suggested during discussion.

If approved, the text would then be remanded to the editorial committee for the detailed editorial work for eventual insertion into the text of Unicode 5.2 (or Unicode 6.0). Additional clarifying text would be inserted into the text of UAX #44, for which there is a separate PRI.

Code Point Labels are suggested as a means of clarifying what exactly is meant by the normative Unicode Name property (the "na" attribute, as recorded in the XML version of the UCD), as opposed to strings constructed to label code points that don't actually have assigned Unicode characters. They would then also formally define the conventions already widely used in the UCD (and elsewhere) for referring to Unicode code points without assigned Unicode characters.

The UTC is seeking feedback from the public regarding the general approach here, as well as any detailed suggestions on the wording proposed.

Outstanding issue: The UTC will need to determined whether Code Point Labels, as defined here, will be considered immutable. That is, would such labels be considered formally a Unicode code point property, and if so, be unchangeable once assigned. This would parallel the way Unicode character names, per se, are handled. (Note that there would need to be an obvious exception for reserved code points, which can get new characters assigned to them, and thus acquire an actual Unicode character name.)

[[ As a 4th bullet under definition D4 Character Name in Chapter 3, insert ]]

[[Incorporate the following text in Section 4.8, "Name -- Normative", as a subsection, with appropriate editorial adjustments to other existing text in that section. ]]

Unicode Character Name

The Name property (short alias: "na") is a string property. Its value for all Graphic and Format characters is the Unicode character name as generally understood.

For Graphic and Format characters other than ideographs and Hangul syllables, the name is as listed in field 1 of UnicodeData.txt.

For Hangul syllables, the name is derived by rule, as specified in Section 3.12, under "Hangul Syllable Name Generation", making use of the values of the Jamo_Short_Name property.

For ideographs, the name is derived by rule, by concatenating the string "CJK UNIFIED IDEOGRAPH-" or "CJK COMPATIBILITY IDEOGRAPH-" (or other as specified, e.g. "TANGUT IDEOGRAPH-") to the code point, expressed in hexadecimal, with the usual 4 to 6 digit convention. The exact ranges subject to these name derivations are specified by a name range convention used in field 1 of UnicodeData.txt.

For all other Unicode code points of all types, the value of the UCD Name property is the null string. In other words, na="".

Note that the Unicode Name property values are unique for all non-null values, but not every Unicode code point has a unique Unicode Name property value. Furthermore, the Name property value uniqueness requirement interacts with name assignment rules for formal aliases and for named character sequences: Unicode character names, formal aliases, and named character sequences constitute a single, unique namespace.

As corollary to this specification, it should be noted that the value of field 1 (the string of characters between the semicolon separators) is to be taken as the normative specification of the UCD Name property only for Graphic and Format characters other than ideographs and Hangul syllables. All other values which occur in field 1 are to be understood as meta-labels that serve other functions in the generation of names lists and charts, or to label abbreviated ranges of property definitions, but do not constitute values of the UCD Name property per se.

[[ In TUS 5.0, on page 79, after the existing definition D10 Code Point, insert the following new definitions. ]]

D10a Code Point Type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.

Table 1: Code Point Type Labels

Type Label
Graphic graphic
Format format
Control control
Reserved reserved
Noncharacter noncharacter
Private-Use private-use
Surrogate surrogate

D10c Code Point Label: A unique label for each code point in the Unicode codespace.

[[ Edit the following specification for the code point label to an appropriate set of bullets and/or body text, to fill out the definition. ]]

The code point label is distinguished from the expression of the code point per se (for example, "U+0000" or "U+0061"), which itself is also a unique identifier, as described in Appendix A, Notational Conventions. (See also Clause 6.5 Short identifiers for code positions (UIDs) in ISO/IEC 10646.)

The Unicode code point label is a unique string value defined as follows:

For any Unicode code point for which the value of the UCD Name property value is non-null, the code point label is identical to the Unicode character name. This will be the case for all Graphic and Format code points.

Otherwise, the code point label is constructed as follows:

Concatenate the code point type label for the code point, "-", plus the 4 to 6 digit representation of the code point.

Table 2: Construction of Code Point Labels

Type Label
Control control-NNNN
Reserved reserved-NNNN
Noncharacter noncharacter-NNNN
Private-Use private-use-NNNN
Surrogate surrogate-NNNN

When displayed in mixed contexts with Unicode character name values, to avoid any possible confusion with actual, non-null Unicode Name values, constructed Unicode code point labels are displayed between angle brackets: <control-0009>, <noncharacter-FFFF>, etc.

APIs which return the value of a Unicode "name" for a given code point might vary somewhat in their behavior. An API which is defined as strictly returning the value of the Unicode Name property (the "na" attribute), should return a null string for any Unicode code point other than graphic or format characters, as that is the actual value of the property for such code points. On the other hand, an API which returns a name for Unicode code points, but which is expected to provide useful, unique labels for unassigned, reserved code points and other special code point types, should return the Unicode Code Point Label, instead. As defined above, this will be the same as the Unicode Name property value for all graphic and format characters.