L2/07-026R3
From: | Mark Davis |
Date: | 2007-01-14 (revised 02-07) |
Re: | Property and Value Alias Issues |
Eric and I have been looking at properties in connection with the XML work that Eric has been doing. In doing so, a number of items have come up. I've captured these below for discussion in the UTC.
# All code points not explicitly listed for Age # have the value unassigned. # @missing: 0000..10FFFF; unassigned
But we don't do that for the string values. Recommendations are in the Table 2 below: proposed is to document in UCD.html and PropertyAliases.txt. Generally results should be some name if it is a catalog-like property, "" (empty) if they are information about a string (such as the bmg), and # (the source character itself) if they are foldings (since unaffected characters should be left alone). This also needs to be applied to the Unihan provisional properties.
cp=CE31, dm=<CE20 11B8>, not <110E 1173 11B8>
blk; n/a ; Arabic_Presentation_Forms-A => blk; n/a ; Arabic_Presentation_Forms_A; Arabic_Presentation_Forms-A
dt ; can ; Canonical => dt ; Can ; Canonical ; can
Note: The simple lowercase may be omitted in the data file if the lowercase is the same as the code point itself.
We need to document this for the other foldings:
cf ; Case_Folding (when not listed) dm ; Decomposition_Mapping FC_NFKC ; FC_NFKC_Closure (when not listed) lc ; Lowercase_Mapping scc ; Special_Case_Condition sfc ; Simple_Case_Folding tc ; Titlecase_Mapping uc ; Uppercase_Mapping
sfc ; Simple_Case_Folding => scf ; Simple_Case_Folding ; sfc
CaseFolding.txt
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0049; T; 0131; # LATIN CAPITAL LETTER I
SpecialCasing.txt
03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
...
0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I
The topic came up during the meeting of the rules for these values. I looked at them and here's what I found.
I suggest that we leave the property the it is, and change the documentation.
Name | Rec. Regex for Allowable Values |
kCheungBauerIndex | /[0-9]{3}\.[0-9]{2}/ |
kFennIndex | /[1-9][0-9]{0,2}\.[01][0-9]/ |
kGSR | /[0-9]{4}[a-vx-z]'?/ |
kHDZRadBreak | /[\x{2F00}-\x{2FD5}]\[U\+2?[0-9A-F]{4}\]:[1-8][0-9]{4}\.[0-9]{2}[012]/ |
kIRGDaeJaweon | /([0-9]{4}\.[0-9]{2}[01])|(0000\.555)/ |
kPhonetic | /[1-9][0-9]{0,3}[A-D]?\*?/ |
kTang | /\*?[A-Za-z\(\)\x{E6}\x{251}\x{259}\x{25B}\x{300}\x{30C}]+/ |
Abbr | Name | Rec. Regex for Allowable Values for the listing of properties in our data files | Rec. Value for Unlisted |
age | Age | /([0-9]+\.[0-9]|unassigned)/ | unassigned (already defined) |
nv | Numeric_Value | /-?[0-9]+\.[0-9]+/ NEEDS fixing for fractions | Nan |
blk | Block | /[a-zA-Z0-9]+([_\ ][a-zA-Z0-9]+)*/ | No_Block (add Script - Unknown) |
sc | Script | The code point itself, but # can be used to represent that in certain circumstances. | |
dm | Decomposition_Mapping | /[\x{0}-\x{10FFFF}]+/ | |
FC_NFKC | FC_NFKC_Closure | ||
cf | Case_Folding | /[\x{0}-\x{10FFFF}]+/ | |
lc | Lowercase_Mapping | ||
tc | Titlecase_Mapping | ||
uc | Uppercase_Mapping | ||
sfc | Simple_Case_Folding | /[\x{0}-\x{10FFFF}]/ | |
slc | Simple_Lowercase_Mapping | ||
stc | Simple_Titlecase_Mapping | ||
suc | Simple_Uppercase_Mapping | ||
bmg | Bidi_Mirroring_Glyph | /[\x{0}-\x{10FFFF}]?/ | "" |
isc | ISO_Comment | /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*|\ Asmus/Ken to supply actual value |
|
na1 | Unicode_1_Name | /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*(\ \((CR|FF|LF|NEL)\))?)?/ look also at the angle brackets. |
"" for na1
null or empty should be the default in properties: in
display the following can be used: Note: or with the form <private-use-E000>, which we use in the charts. |
na | Name | /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*|\ |