L2/08-263


Title: Code Point Labels -- Suggested Wording Details

Source: Ken Whistler

Date: July 25, 2008

Ref:  L2/08-206, L2/08-249


**************************************************************

Background

At UTC #115, I presented the issues regarding the need for
definition of code points labels (and clarifications regarding
the exact values of Unicode character names). I won't repeat
all the background here. For that see L2/08-206.

The UTC approved the general proposal and tasked me in
AI 115-A036 to "Draft documentation of the concept of a
code point label distinct from Unicode character names, for
a future version of the standard."

I have drafted part of that documentation for the proposed
update for UAX #44, "Unicode Character Database," which
is posted for separate discussion and review. See L2/08-249.

The remainder of this document represents my suggested
disposition of the rest of the text, adapted from L2/08-206
proposal text, and with some indications of positions in
the text of the standard.

What I am suggesting now is that if this generic disposition
meets with committee approval, that the text be remanded to
the editorial committee for the detailed editorial work
for eventual insertion into the text of Unicode 5.2 (or Unicode 6.0).


**************************************************************

[[ As a 4th bullet under definition D4 Character Name in Chapter
3, insert ]]

* The detailed specification of the Unicode character names,
  including rules for derivation of some ranges of characters,
  is given in Section 4.8, "Name -- Normative". That section
  also describes the relationship between the normative value
  of the Name property and the contents of the corresponding
  data field in UnicodeData.txt in the Unicode Character Database.
  
[[Incorporate the following text in Section 4.8, "Name -- Normative",
as a subsection, with appropriate editorial adjustments to
other existing text in that section. ]] 

Unicode Character Name

The Name property (short alias: "na") is a string property.
Its value for all Graphic and Format characters is the
Unicode character name as generally understood. 

For Graphic and Format characters other than ideographs and Hangul 
syllables, the name is as listed in field 1 of UnicodeData.txt.

For Hangul syllables, the name is derived by rule, as specified
in Section 3.12, under "Hangul Syllable Name Generation",
making use of the values of the Jamo_Short_Name property.

For ideographs, the name is derived by rule, by concatenating
the string "CJK UNIFIED IDEOGRAPH-" or "CJK COMPATIBILITY IDEOGRAPH-"
(or other as specified, e.g. "TANGUT IDEOGRAPH-") to the code
point, expressed in hexadecimal, with the usual 4 to 6 digit
convention. The exact ranges subject to these name derivations
are specified by a name range convention used in field 1 of
UnicodeData.txt.

For all *other* Unicode code points of all types, the
value of the UCD Name property is the null string. In
other words, na="".

Note that the Unicode Name property values are unique for
all non-null values, but not every Unicode code point has
a unique Unicode Name property value. Furthermore, the
Name property value uniqueness requirement interacts with
name assignment rules for formal aliases and for
named character sequences: Unicode character names, formal
aliases, and named character sequences constitute a single,
unique namespace.

As corollary to this specification, it should be noted that
the value of field 1 (the string of characters between the
semicolon separators) is to be taken as the normative specification
of the UCD Name property only for Graphic and Format
characters other than ideographs and Hangul syllables. All
other values which occur in field 1 are to be understood
as meta-labels that serve other functions in the generation
of names lists and charts, or to label abbreviated ranges of
property definitions, but do *not* constitute values of the
UCD Name property per se.

[[ In TUS 5.0, on page 79, after the existing definition
D10 Code Point, insert the following new definitions. ]]

D10a Code Point Type: Any of the seven fundamental classes
of code points in the standard: Graphic, Format, Control,
Private-Use, Surrogate, Noncharacter, Reserved.

  * See Table 2-3, "Types of Code Points" for a summary of
    the meaning and use of each class.
    
  * For Noncharacter, see also D14 Noncharacter.
  
  * For Reserved, see also D15 Reserved code point.
  
  * For Private-Use, see also D49 Private-use code point.
  
  * For Surrogate, see also D71 High-surrogate code point
    and D73 Low-surrogate code point.
    
D10b Code Point Type Label: A unique label for each code point
type.

  * Each code point type label is a lowercase string, defined
  according to the following table.

[[ Insert as table. Caption: Code Point Type Labels ]]

Type          Label

==========================

Graphic       graphic

Format        format

Control       control

Reserved      reserved

Noncharacter  noncharacter

Private-Use   private-use

Surrogate     surrogate


D10c Code Point Label: A unique label for each code point in
the Unicode codespace.

[[ Edit the following specification for the code point
label to an appropriate set of bullets and/or body
text, to fill out the definition. ]]

The code point label is distinguished from the 
expression of the code point per se (for example, "U+0000"
or "U+0061"), which itself is also a unique identifier,
as described in Appendix A, Notational Conventions.
(See also Clause 6.5 Short identifiers for code positions
(UIDs) in ISO/IEC 10646.)

The Unicode code point label is a unique string value
defined as follows:

For any Unicode code point for which the value of the
UCD Name property value is non-null, the code point label
is identical to the Unicode character name. This will
be the case for all Graphic and Format code points.

Otherwise, the code point label is constructed as follows:

Concatenate the code point type label for the code
point, "-", plus the 4 to 6 digit representation of
the code point. 

[[ Insert as table. Caption: Construction of Code Point Labels ]]

Type          Label

=================================

Control       control-NNNN

Reserved      reserved-NNNN

Noncharacter  noncharacter-NNNN

Private-Use   private-use-NNNN

Surrogate     surrogate-NNNN

When displayed in mixed contexts with Unicode character
name values, to avoid any possible confusion with actual,
non-null Unicode Name values, constructed Unicode code point labels
are displayed between angle brackets: <control-0009>,
<noncharacter-FFFF>, etc.