L2/09-055

Public Review Issue #132: Code Point Name/Label Options

After considering the feedback on http://www.unicode.org/review/pr129.html, the UTC discussed the following options:

Option A. Define a Code Point Label property (as given in pr129.html). This is a derived property based on the existing Name property, plus constructed values for what are null Name property values in Unicode 5.1.

Option B. Define a Code Point Name property. This is a derived property defined in the same way as the Code Point Label in Option A (just a change of property name).

Option C. Don't define a new property, but instead expand the existing Name property to also cover code points that had null values in Unicode 5.1. For more details about what this would look like, see below.

Option D. Status Quo: do not define a new property, do not change the existing Name property.

The concerns around options A or B are that Unicode is already baroque, and the difference between the Name and Code Point Name/Label properties will seem obscure to users, and just cause confusion and errors.

The concerns around option C are that this is a change to an long-existing property, and may cause confusion or difficulties for ISO 10646.

The concerns around option D are the continuing confusion between name values and comments supplied in the Unicode Character Database.

In each of these options, the property values would be the following.

Construction of Names/Labels

Type	Value (NNNN represents the code point)
Controls	control-NNNN
Reserved	reserved-NNNN
Noncharacter	noncharacter-NNNN
Private-Use	private-use-NNNN
Surrogate	surrogate-NNNN
Others	Field 1 of UnicodeData or constructed values for Hangul Syllables or CJK Ideographs

Changes if we do option A or B

The changes for A are given in pr129.html, while the changes for B are a straightforward modification of A.

Changes if we do option C

[[ As a 4th bullet under definition D4 Character Name in Chapter 3, insert ]]

The detailed specification of the Unicode character names, including rules for derivation of some ranges of characters, is given in Section 4.8, "Name -- Normative". That section also describes the relationship between the normative value of the Name property and the contents of the corresponding data field in UnicodeData.txt in the Unicode Character Database.

[[Incorporate the following text in Section 4.8, "Name -- Normative", as a subsection, with appropriate editorial adjustments to other existing text in that section. ]]

Unicode Code Point Name

The Name property (short alias: "na") is a string property, defined as follows:

For Hangul syllables, the Name property value is derived by rule, as specified in Section 3.12, under "Hangul Syllable Name Generation", making use of the values of the Jamo_Short_Name property.
For ideographs, the Name property value is derived by rule, by concatenating the string "CJK UNIFIED IDEOGRAPH-" or "CJK COMPATIBILITY IDEOGRAPH-" (or other as specified, e.g. "TANGUT IDEOGRAPH-") to the code point, expressed in hexadecimal, with the usual 4 to 6 digit convention. The exact ranges subject to these Name derivations are specified by a Name range convention used in Field 1 of UnicodeData.txt.
For other Graphic and Format characters, the Name property value is as listed in Field 1 of UnicodeData.txt.
For all other Unicode code points, the Name property value is constructed from combining a prefix with the code point value, expressed in hexadecimal, with the usual 4 to 6 digit convention. The prefix corresponds to the type of the Code Point Type (control, reserved, noncharacter, private-use, or surrogate) plus "-". For example: "control-009F", "surrogate-D800".

When displayed in mixed contexts, to emphasize the distinction between graphic/format code point names and others, the others are often displayed between angle brackets: <control-0009>, <noncharacter-FFFF>, etc.

Note that the Name property values are unique for all code points. Furthermore, the Name property value uniqueness requirement interacts with name assignment rules for formal aliases and for named character sequences: Unicode character names, formal aliases, and named character sequences constitute a single, unique namespace.

The Name property values for all but reserved code points will not be changed. The Name property values for reserved code points will change if a character is assigned to the code point. For more information, see the Unicode Encoding Stability Policies.

As corollary to this specification, it should be noted that the value of Field 1 (the string of characters between the semicolon separators) in UnicodeData.txt is the normative specification of the UCD Name property only for Graphic and Format characters other than ideographs and Hangul syllables. All other values which occur in Field 1 are labels that serve other functions in the generation of names lists and charts, or to label abbreviated ranges of property definitions, but do not constitute values of the Name property per se.

The term "character name" refers to the Name property value for an encoded character.

[[ In TUS 5.0, on page 79, after the existing definition D10 Code Point, insert the following new definitions. ]]

D10a Code Point Type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.

See Table 2-3, "Types of Code Points" for a summary of the meaning and use of each class.
For Noncharacter, see also D14 Noncharacter.
For Reserved, see also D15 Reserved code point.
For Private-Use, see also D49 Private-use code point.
For Surrogate, see also D71 High-surrogate code point and D73 Low-surrogate code point.

[[The current stability policy is:]]

Once a character is encoded, its character name will not be changed.

[[A request would be made to the officers to change it to be the following:]]

The Unicode Name property value for any non-reserved code point will not be changed. In particular, once a character is encoded its name will not be changed.