Re:    Code Point Name/Label Options

From:    Mark Davis

Date:    2008-11-06

URL:    http://docs.google.com/Doc?id=dfqr8rd5_357ftxjchgb

Here are the three options we discussed in the meeting.

Option A. Code Point Label (defined in [Whistler, L2/08-382]).

Option B. Define a Code Point Name property: Code_Point_Name (short name: CPName). This is a derived property defined in the same way as the Code Point Label in [Whistler, L2/08-382].

Option C. Expand the Name property to also cover code points (with values as defined in Ken's document) that had null values in U5.1.


In each of these options, the value would be as in [Whistler, L2/08-382], with the exception discussed in the meeting for the C0 controls.

Construction of Code Point Names/Labels

Type Value (NNNN represents the code point)
C0 Controls Field 10 of UnicodeData without parentheticals, Eg, FORM FEED.
C1 Controls control-NNNN
Reserved reserved-NNNN
Noncharacter noncharacter-NNNN
Private-Use private-use-NNNN
Surrogate surrogate-NNNN
Others Field 1 of UnicodeData or constructed values for Hangul Syllables or CJK Ideographs 


Changes if we do option C.

[[ As a 4th bullet under definition D4 Character Name in Chapter 3, insert ]]

[[Incorporate the following text in Section 4.8, "Name -- Normative", as a subsection, with appropriate editorial adjustments to other existing text in that section. ]]

Unicode Code Point Name

The Name property (short alias: "na") is a string property, defined as follows:

When displayed in mixed contexts, to emphasize the distinction between graphic/format code point names and others, the latter are often displayed between angle brackets: <control-0009>, <noncharacter-FFFF>, etc.

Note that the Unicode Name property values are unique for all code points. Furthermore, the Name property value uniqueness requirement interacts with name assignment rules for formal aliases and for named character sequences: Unicode character names, formal aliases, and named character sequences constitute a single, unique namespace.

The Name property values for all but reserved code points will not be changed. The Name property values for reserved code points will change if a character is assigned to the code point. For more information, see the Unicode Encoding Stability Policies.

As corollary to this specification, it should be noted that the value of Field 1 (the string of characters between the semicolon separators) is to be taken as the normative specification of the UCD Name property only for Graphic and Format characters other than ideographs and Hangul syllables. All other values which occur in field 1 are labels that serve other functions in the generation of names lists and charts, or to label abbreviated ranges of property definitions, but do not constitute values of the UCD Name property per se.

For any encoded character, the term "Character name" refers to the Code Point Name.

[[ In TUS 5.0, on page 79, after the existing definition D10 Code Point, insert the following new definitions. ]]

D10a Code Point Type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.

[[The current stability policy is:]]

Once a character is encoded, its character name will not be changed.

[[A request should be made to the officers to extend this to:]]

The Unicode Name Property Value for any non-reserved code point will not be changed. In particular, once a character is encoded its name will not be changed.