Re: Computing default UCA collation tables

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 20 2003 - 18:51:22 EDT

  • Next message: Philippe Verdy: "Re: Computing default UCA collation tables"

    Philippe Verdy stated:

    > After reading the proposed update (version 10) of TR10
    > (which describes UCA algorithm), I note thatit still
    > contains the same error in section 7.3 (Compatibility decomposition)
    > when describing the way L3 weights are assigned to the decomposed sequence.

    Noted for correction.

    > I also note that the allkeys.txt file contains a comment field
    > starting with "QQ", whose signification or semantic is not
    > completely clear. It appears when there are canonical or
    > compatibility decompositions (QQC or QQK), but there are often
    > a fourth character added there whose semantic is not clear (here QQKN).

    The QQKN entries are compatibility decompositions involving expansions.
    The QQKM entries are compatibility decompositions with combining
       sequences involved (just a few Arabic positional forms with harakat).
       
    >
    > Was it for debugging purpose, to see which type of "compatible"
    > decomposition is performed,

    Yes, although it is for diagnostics, rather than debugging, per se.
    The "QQ" is an otherwise unused letter sequence that makes it easy
    to grep through allkeys.txt to collect together the weighted keys
    of various types to look for consistency issues.
     
    >
    > When I look in "allkeys.txt" I can find two similar ligatures
    > for "st" (in fact a ligature between the standard-form s or
    > long-form s and t) which are:
    > FB05 ; [.0BA7.0020.0004.FB05][.0000.0154.0004.FB05][.0BBF.0020.001F.FB05]
    > # LATIN SMALL LIGATURE LONG S T; QQKN
    > FB06 ; [.0BA7.0020.0004.FB06][.0BBF.0020.0004.FB06]
    > # LATIN SMALL LIGATURE ST; QQKN
    >
    > Both give a compatible (NFKD) decomposition into long-s+t or s+t
    > that can be found in UCD:
    > FB05;LATIN SMALL LIGATURE LONG S T;Ll;0;L;<compat> 017F 0074;;;;N;;;;;
    > FB06;LATIN SMALL LIGATURE ST;Ll;0;L;<compat> 0073 0074;;;;N;;;;;
    >
    > Here the decomposition does not use any L3=MAX weight, despite
    > it's a compatibility decomposition, and not strictly a
    > canonical decomposition.

    Input source of relevance:

    017F;LATIN SMALL LETTER LONG S;Ll;<sort> 0073 F8F1;;;0053;;0053

    >
    > Now look at the German sharp-s:
    >
    > 00DF ; [.0BA7.0020.0004.00DF][.0000.0153.0004.00DF][.0BA7.0020.001F.00DF]
    > # LATIN SMALL LETTER SHARP S; QQKN
    >
    > The first two collation elements are corresponding to the compatible
    > (but not NFKD) decomposition of sharp-s into long-s+s, where sharp-s
    > is given a (not NFKD) decomposition too:
    > 017F ; [.0BA7.0020.0004.017F][.0000.0154.0004.017F]
    > # LATIN SMALL LETTER LONG S; QQKN
    > 0053 ; [.0BA7.0020.0008.0053] # LATIN CAPITAL LETTER S

    Input source of relevance:

    00DF;LATIN SMALL LETTER SHARP S;Ll;<sort> 0073 F8F0 0073;;German;;;

    >
    > But it seems that the secondary weight was modified too, and L2=0154
    > just means "second variant form", and L2=0153 means "first variant
    > form", with various usages between distinct scripts (for example
    > Sindhi variants in Arabic). Clearly, there's some manually added
    > tailoring rule here, whose role is not clear,

    See above input to the sifter, which explains all.

    > and dynamic allocation
    > of letter variants in L2 from a base which is reset for each L1 weight.

    Not dynamic. Just level two weights assigned to U+F8F0 and U+F8F1
    to create the results specified by the committees.

    >
    > In the main UCD file, the sharp-s ligature is not decomposed, not
    > even with NFKD:
    > 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;;
    >
    > It's a shame that UCD did not specify a <compat> decomposition
    > for the German sharp-s...

    Well, perhaps. But that is neither here nor there for the
    collation table.

    >
    > Where can we find those extra decompositions that are clearly
    > used in UCA?

    In the input source file I have been citing.

    > Should there a documentation for the new type of
    > UCA decomposition of Unicode strings in (non portable) Unicode
    > strings that may use supplementary private characters (whose
    > "private" semantic would still be normative and related to the
    > exclusive use in UCA)?

    Why?

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 19:37:24 EDT