Re: Computing default UCA collation tables

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 20 2003 - 14:40:44 EDT

  • Next message: Kenneth Whistler: "Re: Computing default UCA collation tables"

    After reading the proposed update (version 10) of TR10 (which describes UCA algorithm), I note thatit still contains the same error in section 7.3 (Compatibility decomposition) when describing the way L3 weights are assigned to the decomposed sequence.

    It says (for the sample with the NFKD decomposition of U+2475 _(2)_:

    [quote]

    3. Set the first two L3 values to be lookup(L3), where the lookup uses the table in §7.3.1 Tertiary Weight Table. Set the remaining L3 values to MAX (which in the default table is 001F):
    0028 [*023D.0020.0004] % LEFT PARENTHESIS
    0032 [.06C8.0020.001F] % DIGIT TWO
    0029 [*023E.0020.001F] % RIGHT PARENTHESIS

    4. Concatenate the result to produce the sequence of collation elements that the character maps to.
    2475 [*023D.0020.0004] [.06C8.0020.0004] [*023E.0020.0004]

    [/quote]

    The description sentences do not reflect what is effectively produced by this "algorithm" for L3 weights.

    [quote]

    3. Set the first two L3 values to be lookup(L3), where the lookup uses the table in §7.3.1 Tertiary Weight Table. Set the remaining L3 values to MAX (which in the default table is 001F):
    0028 [*023D.0020.0004] % LEFT PARENTHESIS
    0032 [.06C8.0020.0004] % DIGIT TWO
    0029 [*023E.0020.001F] % RIGHT PARENTHESIS

    4. Concatenate the result to produce the sequence of collation elements that the character maps to.
    2475 [*023D.0020.0004] [.06C8.0020.0004] [*023E.0020.001F]

    [/quote]

    It is not clear how many leading collation elements must be given the lookup(L3) weight, and if there must be a MAX weight applied...

    In allkeys.txt (3.1.1), it says:
    2475 ; [*027A.0020.0004.2475][.0A0D.0020.0004.2475][*027B.0020.001F.2475] # PARENTHESIZED DIGIT TWO; QQKN
    (this is correct, even though the L1 keys are different, to take into account an extended repertoire of characters and scripts that were added since TR10 v9 was published).

    Note: I looked in the "FractionalUCA.txt" file provided in ICU (weight values are using different bases but collate the same way):
    2475; [0845, 05, 09][1A94, 05, 09][0847, 05, 3D]
    And it correctly maps the lookup(L3) weight for the first two characters, and remaps the L3 key only for the last decomposed collation element.

    So even the proposed update contains this error.

    I also note that the allkeys.txt file contains a comment field starting with "QQ", whose signification or semantic is not completely clear. It appears when there are canonical or compatibility decompositions (QQC or QQK), but there are often a fourth character added there whose semantic is not clear (here QQKN).

    Was it for debugging purpose, to see which type of "compatible" decomposition is performed, such as an extra decomposition such as the one used for the German small sharp s (which is a compatible ligature of a long-form lowercase s and a "standard" lower case s) ?

    When I look in "allkeys.txt" I can find two similar ligatures for "st" (in fact a ligature between the standard-form s or long-form s and t) which are:
    FB05 ; [.0BA7.0020.0004.FB05][.0000.0154.0004.FB05][.0BBF.0020.001F.FB05] # LATIN SMALL LIGATURE LONG S T; QQKN
    FB06 ; [.0BA7.0020.0004.FB06][.0BBF.0020.0004.FB06] # LATIN SMALL LIGATURE ST; QQKN

    Both give a compatible (NFKD) decomposition into long-s+t or s+t that can be found in UCD:
    FB05;LATIN SMALL LIGATURE LONG S T;Ll;0;L;<compat> 017F 0074;;;;N;;;;;
    FB06;LATIN SMALL LIGATURE ST;Ll;0;L;<compat> 0073 0074;;;;N;;;;;

    Here the decomposition does not use any L3=MAX weight, despite it's a compatibility decomposition, and not strictly a canonical decomposition.

    Now look at the German sharp-s:

    00DF ; [.0BA7.0020.0004.00DF][.0000.0153.0004.00DF][.0BA7.0020.001F.00DF] # LATIN SMALL LETTER SHARP S; QQKN

    The first two collation elements are corresponding to the compatible (but not NFKD) decomposition of sharp-s into long-s+s, where sharp-s is given a (not NFKD) decomposition too:
    017F ; [.0BA7.0020.0004.017F][.0000.0154.0004.017F] # LATIN SMALL LETTER LONG S; QQKN
    0053 ; [.0BA7.0020.0008.0053] # LATIN CAPITAL LETTER S

    But it seems that the secondary weight was modified too, and L2=0154 just means "second variant form", and L2=0153 means "first variant form", with various usages between distinct scripts (for example Sindhi variants in Arabic). Clearly, there's some manually added tailoring rule here, whose role is not clear, and dynamic allocation of letter variants in L2 from a base which is reset for each L1 weight.

    In the main UCD file, the sharp-s ligature is not decomposed, not even with NFKD:
    00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;;

    It's a shame that UCD did not specify a <compat> decomposition for the German sharp-s...

    Where can we find those extra decompositions that are clearly used in UCA? Should there a documentation for the new type of UCA decomposition of Unicode strings in (non portable) Unicode strings that may use supplementary private characters (whose "private" semantic would still be normative and related to the exclusive use in UCA)?



    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 15:34:50 EDT