L2/07-026

From: Mark Davis
Date: 2007-01-14
Re: Property and Value Alias Issues

Eric and I have been looking at properties in connection with the XML work that Eric has been doing. In doing so, a number of items have come up. I've captured these below for discussion in the UTC.

  1. The regex in the Unihan descriptions are useful for testing. Eric has noted, however, that they need some fixing: see Table 1 below. Two other items:
    1. The regex notation in Unihan.xml should use a standard regex notation for codepoint literals, such as Perl: \x{...}
    2. There is an error with kFourCornerCode for U+6F5E, "3716. 3716.4"
  2. For non-enumerated regular properties, it would be useful to have those as well, perhaps in PropertyValueAliases.txt. Table 2 has a draft set for discussion.
  3. We need to be more explicit about some of the string values, since reasonable people can differ in interpretation currently. One issue is what the value of the property is for code points not listed, such as unassigned code points. For other properties, we now document that in the data files, such as in DerivedAge.txt:
    #  All code points not explicitly listed for Age
    #  have the value unassigned.
    
    # @missing: 0000..10FFFF; unassigned

    But we don't do that for the string values. Recommendations are in the Table 2 below. Generally results should be some name if it is a catalog-like property, "" (empty) if they are information about a string (such as the bmg), and # (the source character itself) if they are foldings (since unaffected characters should be left alone). This also needs to be applied to the Unihan provisional properties

  4. We do not make clear in PropertyValueAliases.txt what the default notation is for booleans. Eric chose N/Y on the pattern of NFD_Quick_Check, while I'd been using F/T. We should document whatever we choose in PropertyValueAliases.txt (and probably Eric's choice is the better one). Note that this places no requirement on APIs; it is just the format we choose for relaying information.
  5. The Jamo property was not done for 5.0, as per the following action. It should be fixed in the next version. This needs no action from the UTC, since we already have an action to do it.
  6. We need to document that the algorithmic decomposition mapping values for Hangul syllables are not the full ones but the pair-wise ones. These correspond to all the other decomposition mappings for NFC. Example:

    cp=CE31, dm=<CE20 11B8>, not <110E 1173 11B8>

  7. Eric found a problem in CompositionExclusions in a comment: "if you look at the character count for pile #3, it says 924. I believe it should be 1030. If you just add the four largest ranges, you already get more than 924: 542+106+59+270 = 977."
  8. The intention for the canonicalized block names in http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt is for them to be suitable for use as identifiers. But some of them have "-" in the name. The proposal is to make the old value an alias and add the fixed new value. Here is an example of the change.
    blk; n/a       ; Arabic_Presentation_Forms-A
    =>
    blk; n/a       ; Arabic_Presentation_Forms_A; 	Arabic_Presentation_Forms-A

    Note: an alternative is to just replace them, since we specify that name matching ignores case differences.

  9. The canonical names in http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt are all title or uppercase, except for Decomposition_Type (dt). It would be more uniform if we fixed them. Here is an example of the change.
    dt ; can       ; Canonical
    =>
    dt ; Can       ; Canonical   ; can

    Note: an alternative is to just replace them, since we specify that name matching ignores those characters.

  10. The values of the String properties need to be better documented regarding blank values in the source files. Where the value in the UnicodeData is blank, that indicates that the code point maps to itself. Thus the Lowercase_Mapping("a") = "a", not the empty string. We document this for the case of the simple lower/title/uppercase mappings (as below, from UCD.html):

    Note: The simple lowercase may be omitted in the data file if the lowercase is the same as the code point itself.

    We need to document this for the other foldings:

    cf        ; Case_Folding
    dm        ; Decomposition_Mapping
    FC_NFKC   ; FC_NFKC_Closure
    lc        ; Lowercase_Mapping
    scc       ; Special_Case_Condition
    sfc       ; Simple_Case_Folding
    tc        ; Titlecase_Mapping
    uc        ; Uppercase_Mapping
  11. The abbreviation sfc for Simple_Case_Folding has two letters reversed. Thus it should be fixed to:
    sfc ; Simple_Case_Folding
    =>
    scf ; Simple_Case_Folding ; sfc
  12. The numeric values given in http://unicode.org/Public/UNIDATA/extracted/DerivedNumericValues.txt are in decimal format (eg for U+00BD nv="0.5"), while the format in UCD.html is rational numbers (eg "1/2"). We should consider fixing this lack of synchrony, probably by changing the DerivedNumericValues.txt format.
  13. The scc / Special_Case_Condition property is not really well defined in terms of its values. The overall recommendation from the ed committee is that this be retracted as a property, and that following information be characterized in UCD.html and in the XML version as "conditional casing data" instead of a formal property:

    CaseFolding.txt
    0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE
    0049; T; 0131; # LATIN CAPITAL LETTER I

    SpecialCasing.txt
    03A3; 03C2; 03A3; 03A3; Final_Sigma; # GREEK CAPITAL LETTER SIGMA
    ...
    0069; 0069; 0130; 0130; az; # LATIN SMALL LETTER I

     

Table 1. Proposed Modified Regex for Unihan
Name Rec. Regex for Allowable Values
kCheungBauerIndex /[0-9]{3}\.[0-9]{2}/
kFennIndex /[1-9][0-9]{0,2}\.[01][0-9]/
kGSR /[0-9]{4}[a-vx-z]'?/
kHDZRadBreak /[\x{2F00}-\x{2FD5}]\[U\+2?[0-9A-F]{4}\]:[1-8][0-9]{4}\.[0-9]{2}[012]/
kIRGDaeJaweon /([0-9]{4}\.[0-9]{2}[01])|(0000\.555)/
kPhonetic /[1-9][0-9]{0,3}[A-D]?\*?/
kTang /\*?[A-Za-z\(\)\x{E6}\x{251}\x{259}\x{25B}\x{300}\x{30C}]+/

 

Table 2. Proposed Regex and Unlisted values
Abbr Name Rec. Regex for Allowable Values Rec. Value for Unlisted
age Age /([0-9]+\.[0-9]|unassigned)/ unassigned (already defined)
nv Numeric_Value /-?[0-9]+\.[0-9]+/ Nan
blk Block /[a-zA-Z0-9]+([_\ ][a-zA-Z0-9]+)*/ No_Block
sc Script #
dm Decomposition_Mapping /[\x{0}-\x{10FFFF}]+/
FC_NFKC FC_NFKC_Closure
cf Case_Folding /[\x{0}-\x{10FFFF}]+/
lc Lowercase_Mapping
tc Titlecase_Mapping
uc Uppercase_Mapping
sfc Simple_Case_Folding /[\x{0}-\x{10FFFF}]/
slc Simple_Lowercase_Mapping
stc Simple_Titlecase_Mapping
suc Simple_Uppercase_Mapping
bmg Bidi_Mirroring_Glyph /[\x{0}-\x{10FFFF}]?/ ""
isc ISO_Comment /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*|\)?/
na1 Unicode_1_Name /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*(\ \((CR|FF|LF|NEL)\))?)?/ <reserved>
na Name /([A-Z0-9]+(([-\ ]|\ -|-\ )[A-Z0-9]+)*|\)?/