Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon May 19 2003 - 18:52:24 EDT

  • Next message: Allen Haaheim: "Re: Decimal separator with more than one character?"

    My apologies; I jotted off the note quickly, and didn't read your
    response carefully enough (and then I was out of town and couldn't
    address this right away).

    > To take the same example as I took in my previous email, I don't see
    > how S1,S2 and S3 could be sorted S1 < S2 < S3 (instead of S1 < S3 <
    S2)
    > without contracting the sequence of 'U+1169 (ㅗ:HANGUL JUNGSEONG O)
    > U+1163 (ㅑ:HANGUL JUNGSEONG YA)'?
    >
    > S1: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG
    O)
    > U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
    > S2: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+116A (ㅘ:HANGUL JUNGSEONG
    WA)
    > U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
    > S3: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG
    O)
    > U+1163 (ㅑ:HANGUL JUNGSEONG YA) U+11A8 (ㄱ:HANGUL JONGSEONG
    KIYEOK)

    Let me recap. As I said, we have strategy (a)

    >> a. decompose them.
    >>
    >> U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X Y
    >>
    ...
    >> Yes, (a) my preference as well, as I stated. It is more flexible,
    >> since it works for any repertoire. It may or may not yield longer
    sort
    >> keys, depending on whether the sort keys are compressed or not (as
    in
    >> ICU). The issue is that a small set of characters will compress
    >> better, even if the starting weight sequences are longer.

    So let's look at your example, where strategy (a) is applied. There is
    no need for subcluster terminators. The characters have the following
    weights (this is where I blundered, since you had already
    preweighted).:

      U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) : 301
      U+1161 (ㅏ:HANGUL JUNGSEONG A) : 201
      U+1163 (ㅑ:HANGUL JUNGSEONG YA) : 231
      U+1169 (ㅗ:HANGUL JUNGSEONG O) : 251
      U+116A (ㅘ:HANGUL JUNGSEONG WA) : 255
      U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK) : 101

    The goal is: S1 < S2 < S3

    Applying strategy (a), if the weight for WA is expanded (as per the
    old compat mappings in UnicodeData-2.0.txt, where 116A => 1169 1161),
    then we get:

      U+116A (ㅘ:HANGUL JUNGSEONG WA) : 251, 201*

    [*Now, one may want the character to expand to a sequence that is
    primary, secondary, or tertiary different, but for now I'll just
    assume that identity is ok.]

    You then get the following ordering.

    S1: 301, 251, 101, TERM
    S2: 301, 251, 201, 101, TERM
    S3: 301, 251, 231, 101, TERM

    In many circumstances one has the option of expanding one character
    (in collation weights) or contracting other characters. We have to
    look at the combinatorics to see which is better.

    What I think did not come through in my previous messages is that the
    only difference between (a) and (b) is in their treatement of L
    sequences: both expand weights for V's and T's.

    The downside of (b) is that one has to have a known repertoire of L
    sequences, those that figure into contractions.

    Mark

    P.S. My email at this address is not working well, so I can't read
    some of the recent messages and may not get responses to this right
    away.



    This archive was generated by hypermail 2.1.5 : Mon May 19 2003 - 19:32:00 EDT