From

L2/07-026

From:	Mark Davis
Date:	2007-01-14
Re:	Property and Value Alias Issues

Eric and I have been looking at properties in connection with the XML work that Eric has been doing. In doing so, a number of items have come up. I've captured these below for discussion in the UTC.

The regex in the Unihan descriptions are useful for testing. Eric has noted, however, that they need some fixing: see Table 1 below. Two other items:
1. The regex notation in Unihan.xml should use a standard regex notation for codepoint literals, such as Perl: \x{...}
2. There is an error with kFourCornerCode for U+6F5E, "3716. 3716.4"
For non-enumerated regular properties, it would be useful to have those as well, perhaps in PropertyValueAliases.txt. Table 2 has a draft set for discussion.
We need to be more explicit about some of the string values, since reasonable people can differ in interpretation currently. One issue is what the value of the property is for code points not listed, such as unassigned code points. For other properties, we now document that in the data files, such as in DerivedAge.txt:
```
#  All code points not explicitly listed for Age
#  have the value unassigned.

# @missing: 0000..10FFFF; unassigned
```
But we don't do that for the string values. Recommendations are in the Table 2 below. Generally results should be some name if it is a catalog-like property, "" (empty) if they are information about a string (such as the bmg), and # (the source character itself) if they are foldings (since unaffected characters should be left alone). This also needs to be applied to the Unihan provisional properties
We do not make clear in PropertyValueAliases.txt what the default notation is for booleans. Eric chose N/Y on the pattern of NFD_Quick_Check, while I'd been using F/T. We should document whatever we choose in PropertyValueAliases.txt (and probably Eric's choice is the better one). Note that this places no requirement on APIs; it is just the format we choose for relaying information.
The Jamo property was not done for 5.0, as per the following action. It should be fixed in the next version. This needs no action from the UTC, since we already have an action to do it.
- [106-C20] Consensus: Document the Jamo_Short_Name property as a "contributory" property for Unicode 5.0 in UCD.html, PropertyAliases.txt and PropertyValueAliases.txt. Ref L2/05-379R.
We need to document that the algorithmic decomposition mapping values for Hangul syllables are not the full ones but the pair-wise ones. These correspond to all the other decomposition mappings for NFC. Example:
cp=CE31, dm=<CE20 11B8>, not <110E 1173 11B8>
Eric found a problem in CompositionExclusions in a comment: "if you look at the character count for pile #3, it says 924. I believe it should be 1030. If you just add the four largest ranges, you already get more than 924: 542+106+59+270 = 977."
The intention for the canonicalized block names in http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt is for them to be suitable for use as identifiers. But some of them have "-" in the name. The proposal is to make the old value an alias and add the fixed new value. Here is an example of the change.
```
blk; n/a       ; Arabic_Presentation_Forms-A
=>
blk; n/a       ; Arabic_Presentation_Forms_A; 	Arabic_Presentation_Forms-A
```
Note: an alternative is to just replace them, since we specify that name matching ignores case differences.
The canonical names in http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt are all title or uppercase, except for Decomposition_Type (dt). It would be more uniform if we fixed them. Here is an example of the change.
```
dt ; can       ; Canonical
=>
dt ; Can       ; Canonical   ; can
```
Note: an alternative is to just replace them, since we specify that name matching ignores those characters.
The values of the String properties need to be better documented regarding blank values in the source files. Where the value in the UnicodeData is blank, that indicates that the code point maps to itself. Thus the Lowercase_Mapping("a") = "a", not the empty string. We document this for the case of the simple lower/title/uppercase mappings (as below, from UCD.html):
Note: The simple lowercase may be omitted in the data file if the lowercase is the same as the code point itself.

We need to document this for the other foldings:
```
cf        ; Case_Folding
dm        ; Decomposition_Mapping
FC_NFKC   ; FC_NFKC_Closure
lc        ; Lowercase_Mapping
scc       ; Special_Case_Condition
sfc       ; Simple_Case_Folding
tc        ; Titlecase_Mapping
uc        ; Uppercase_Mapping
```
The abbreviation sfc for Simple_Case_Folding has two letters reversed. Thus it should be fixed to:
```
sfc ; Simple_Case_Folding
=>
scf ; Simple_Case_Folding ; sfc
```
The numeric values given in http://unicode.org/Public/UNIDATA/extracted/DerivedNumericValues.txt are in decimal format (eg for U+00BD nv="0.5"), while the format in UCD.html is rational numbers (eg "1/2"). We should consider fixing this lack of synchrony, probably by changing the DerivedNumericValues.txt format.
The scc / Special_Case_Condition property is not really well defined in terms of its values. The overall recommendation from the ed committee is that this be retracted as a property, and that following information be characterized in UCD.html and in the XML version as "conditional casing data" instead of a formal property:
CaseFolding.txt
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0049; T; 0131; # LATIN CAPITAL LETTER I

SpecialCasing.txt
03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
...
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I

**Table 1. Proposed Modified Regex for Unihan**
Name	Rec. Regex for Allowable Values
kCheungBauerIndex	/[0-9]{3}\.[0-9]{2}/
kFennIndex	/[1-9][0-9]{0,2}\.[01][0-9]/
kGSR	/[0-9]{4}[a-vx-z]'?/
kHDZRadBreak	/[\x{2F00}-\x{2FD5}]\[U\+2?[0-9A-F]{4}\]:[1-8][0-9]{4}\.[0-9]{2}[012]/
kIRGDaeJaweon	/([0-9]{4}\.[0-9]{2}[01])\|(0000\.555)/
kPhonetic	/[1-9][0-9]{0,3}[A-D]?\*?/
kTang	/\*?[A-Za-z\(\)\x{E6}\x{251}\x{259}\x{25B}\x{300}\x{30C}]+/

**Table 2. Proposed Regex and Unlisted values**
Abbr	Name	Rec. Regex for Allowable Values	Rec. Value for Unlisted
age	Age	/([0-9]+\.[0-9]\|unassigned)/	unassigned (already defined)
nv	Numeric_Value	/-?[0-9]+\.[0-9]+/	Nan
blk	Block	/[a-zA-Z0-9]+([_\ ][a-zA-Z0-9]+)*/	No_Block
sc	Script	/[a-zA-Z0-9]+([_\ ][a-zA-Z0-9]+)*/	#
dm	Decomposition_Mapping	/[\x{0}-\x{10FFFF}]+/
FC_NFKC	FC_NFKC_Closure	/[\x{0}-\x{10FFFF}]+/
cf	Case_Folding	/[\x{0}-\x{10FFFF}]+/
lc	Lowercase_Mapping
tc	Titlecase_Mapping
uc	Uppercase_Mapping
sfc	Simple_Case_Folding	/[\x{0}-\x{10FFFF}]/
slc	Simple_Lowercase_Mapping
stc	Simple_Titlecase_Mapping
suc	Simple_Uppercase_Mapping
bmg	Bidi_Mirroring_Glyph	/[\x{0}-\x{10FFFF}]?/	""
isc	ISO_Comment	/([A-Z0-9]+(([-\ ]\|\ -\|-\ )[A-Z0-9]+)*\|\)?/	""
na1	Unicode_1_Name	/([A-Z0-9]+(([-\ ]\|\ -\|-\ )[A-Z0-9]+)*(\ \((CR\|FF\|LF\|NEL)\))?)?/	<reserved>
na	Name	/([A-Z0-9]+(([-\ ]\|\ -\|-\ )[A-Z0-9]+)*\|\)?/	<reserved>