Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 17 2003 - 13:04:55 EDT

  • Next message: Marco Cimarosti: "RE: Decimal separator with more than one character?"

    From: "Mark Davis" <mark.davis@jtcsv.com>
    To: "Jungshik Shin" <jshin@mailaps.org>; "Unicode Mailing List" <unicode@unicode.org>
    Sent: Saturday, May 17, 2003 4:36 AM
    Subject: Re: Proposed Update of UTS #10: Unicode Collation Algorithm

    >
    > > To take the same example as I took in my previous email, I don't see
    > > how S1,S2 and S3 could be sorted S1 < S2 < S3 (instead of S1 < S3 <
    > S2)
    > > without contracting the sequence of 'U+1169 (ㅗ:HANGUL JUNGSEONG O)
    > > U+1163 (ㅑ:HANGUL JUNGSEONG YA)'?
    > >
    > > S1: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG
    > O)
    > > U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
    > > S2: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+116A (ㅘ:HANGUL JUNGSEONG
    > WA)
    > > U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
    > > S3: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG
    > O)
    > > U+1163 (ㅑ:HANGUL JUNGSEONG YA) U+11A8 (ㄱ:HANGUL JONGSEONG
    > KIYEOK)
    > >
    > > where the primary weights of each Jamo are given as following,
    > >
    > > U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) : 301
    > > U+1161 (ㅏ:HANGUL JUNGSEONG A) : 201
    > > U+1163 (ㅑ:HANGUL JUNGSEONG YA) : 231
    > > U+1169 (ㅗ:HANGUL JUNGSEONG O) : 251
    > > U+116A (ㅘ:HANGUL JUNGSEONG WA) : 255
    > > U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK) : 101
    >
    > Remember, the weights have to be changed so that: T < V < L, so I'll
    > add 3000 to Ls, 2000 to Vs and 1000 to Ts
    >
    > S1 => 3301; 2251; 1101; TERM
    > S2 => 3301; 2255; 1101; TERM
    > S3 => 3301; 2251; 1231; 1101; TERM

    Correction on the third character:
    S3 => 3301; 2251; 2231; 1101; TERM
    So we already have S1 < S2 < S3 appropriately.

    This discussion goes too far. One could think reliable that all L character are base characters, ignoring the fact that they can comine each other.

    Then all V characters collate exactly like combining diacritics to their base character (exactly like an ACCUTE accent on a Latin letter) and can for decomposable characters (like LV syllables which can collate and decompose exactly like an <E-ACUTE> collates as the decomposed <E, ACUTE> with a tertiary weight.

    Then all T characters collate with quaternary level (as if they were minor diacritics that apply to a base primary L character or base secodary V character).

    In that case, all the UCA table can be built by decomposing in one step all codepoints to base and combining diacritics, each one given a single weight in ranges chosen so that primary weights are always higher then secondary ones, themselves always higher than tertiary ones, and ditto for quaternary weights.

    The way these exclusive ranges of weights are allocated for each collation level can be highly optimized: for example primary weights can be given all positive values, and the negative space can be subdivided by collation level so that simple bit operations on the high bits provide this level. (The positive space can also be divided by letter case property)

    So the implementation mostly ressembles to a NFKD decomposition table from codepoints to small "collation points" from which a simple lookup will provide their weight. The "collation points" can consist mostly in standard Unicode codepoints, plus added non-character codepoints allocated for the collation algorithm in the private space or better in the "internal process" space for non-characters (internally they can also be allocated in reserved and currently unallocated planes, as long as input strings are checked so that any future occurence of characters out of the BMP, SMP, SIP, SIP2, and private planes are given a standard default collation without lookup in the extended decompositions).

    So this collation decomposition table can benefit from the existing NFD and NFKD decompositions, just providing additional decompositions, from which collation weight can be looked up, and collation levels deduced from collation weights. I'e tried this strategy, and the very large "allkeys.txt" provided in the UCA reference compresses a lot and reuses the decomposition engine written from NFD and NFKD, and this allows the NFD, NFKD, and collation decompositions to be merged in a unique decomposition table.

    With this strategy of course we get "larger" collation keys, but this is artificial for one reason:
    my approach generates from tale lookups collations keys which are only with the following equivalent multilevel format:
    [0.0.0.U] or [0.0.T.U] or [0.S.0.U] or [P.0.0.U], where for all (T,S,P) we have 0<T<S<P, and U is the implicit initial Unicode codepoint of basic characters that are equal to their NFC and NFD forms (so it does not need to be stored in tables).

    A sequence of fully decomposed collation keys like [0.0.0.U1][0.0.T.U2][0.S.0.U3][P.0.0.U4] is equivalent to the single multilevel collation key [P.S.T.U1]; this is the type of most "composed" collation keys that the "allkeys.txt" file in the UCA reference contains.

    -- Philippe.



    This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 13:51:01 EDT