Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Jungshik Shin (jshin@mailaps.org)
Date: Fri May 16 2003 - 21:24:19 EDT

  • Next message: John Cowan: "John's Own Version of Unicode Conformance, Version 4.0"

    On Fri, 16 May 2003, Mark Davis wrote:

    > 1. For the "precomposed" jamos, there are two solutions.
    >
    > Suppose we have:
    >
    > U+1105(ᄅ) HANGULCHOSEONG RIEUL) => X
    > U+1112(ᄒ: HANGUL CHOSEONG HIEUH) => Y
    >
    > a. decompose them.
    >
    > U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X Y

    > b. interleave them and treat their constitutent sequences as
    > contractions.
    >
    > U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X'
    > U+1105(ᄅ) HANGULCHOSEONG RIEUL), U+1112(ᄒ: HANGUL CHOSEONG HIEUH)
    > => X'

    > > What I don't like is the inflexibility of having to collect all the
    > > known occurrence of cluster Jamos and giving each of them the
    > > primary weight in such a way (interleaving) that they can get
    > > collated the way expected by (South) Koreans

    > Yes, (a) my preference as well, as I stated. It is more flexible,
    > since it works for any repertoire. It may or may not yield longer sort

      I'm pleased to know that we agree on this point.

    > >With T < V < L, why would we need to terminate L+, V+ and T+
    > > separately
    > > instead of just 'L+V+T*' as a whole? Or am I missing something
    > > obvious?
    >
    > 2. No, that is the purpose of the ordering of those weights. For the

      OK. Thank you for the clarification. What you wrote in your prev. message
    ('Terminating each of the subclusters would') made me think you thought
    otherwise.

    > 3. You don't need to terminate the whole syllable, IF you are willing
    > to have all sequences of L's contract (as in 1b). What you do there is
    > contract all Ls, and then weight L <... TRAILING_WEIGHT < T < V. In
    ....

       As I mentioned, I don't like the inflexibility of having to contract
    the known sequences (1b). Therefore, my strong preference is 1a even if
    that requires us to terminate syllables.

    > You don't have to contract sequences of T's or V's, since those are
    > taken care of by the T < V weighting. E.g.
    >
    > L V1 X
    > L V1 T
    > L V1 V

     If we take 1a, this is not an issue. This is only relevant if we take 1b.
    Let me try one final time because it seems like I once again failed
    to convey what I meant. I'm afraid your answer didn't address the
    following issue. My concern is not 'L V1 X' vs 'L V1 T' vs 'L V1 V2'
    because it's already well taken care of as you wrote.

    > > mentioned only about sequences of L's. For instance, suppose that
    > > we have three sequences LV1, LV2, and LV1V4 where V2 is a cluster of
    > > V1 and V3 and that the desired collation among them is LV1 < LV2 (=
    > > LV1V3) < LV1V4. Without contracting V1V4 and giving it an
    > > indepenent primary weight larger than that of V2, they'd be sorted
    > > LV1 < LV1V4 < LV2, instead.

    To take the same example as I took in my previous email, I don't see
    how S1,S2 and S3 could be sorted S1 < S2 < S3 (instead of S1 < S3 < S2)
    without contracting the sequence of 'U+1169 (ㅗ:HANGUL JUNGSEONG O)
    U+1163 (ㅑ:HANGUL JUNGSEONG YA)'?

      S1: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG O)
          U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
      S2: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+116A (ㅘ:HANGUL JUNGSEONG WA)
          U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
      S3: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG O)
          U+1163 (ㅑ:HANGUL JUNGSEONG YA) U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)

    where the primary weights of each Jamo are given as following,

      U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) : 301
      U+1161 (ㅏ:HANGUL JUNGSEONG A) : 201
      U+1163 (ㅑ:HANGUL JUNGSEONG YA) : 231
      U+1169 (ㅗ:HANGUL JUNGSEONG O) : 251
      U+116A (ㅘ:HANGUL JUNGSEONG WA) : 255
      U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK) : 101

    > > enumerating all equivalent sequences but just giving primary weights
    > > to only 'basic' Jamos and requiring a preprocessing in which cluster
    > > jamos are decomposed into sequences of basic Jamos.
    >
    > Preprocessing (on a string basis) is *deadly* for performance. It is
    > also not necessary. The weight tables already allow characters to
    > expand, that is what would be done in this case: it is just 1a above.

       I see your point. I didn't pay attention to expansion.

       Jungshik



    This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 21:58:05 EDT