Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri May 16 2003 - 19:20:55 EDT

  • Next message: Kenneth Whistler: "Re: character groupings in various languages"

    The following message goes in lots of different directions, so let me
    try to summarize here.

    1. For the "precomposed" jamos, there are two solutions.

    Suppose we have:

       U+1105(ᄅ) HANGULCHOSEONG RIEUL) => X
       U+1112(ᄒ: HANGUL CHOSEONG HIEUH) => Y

    a. decompose them.

       U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X Y

    b. interleave them and treat their constitutent sequences as
    contractions.

       U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X'
       U+1105(ᄅ) HANGULCHOSEONG RIEUL), U+1112(ᄒ: HANGUL CHOSEONG HIEUH)
    => X'

    For b, you need to do both steps, so that you get the same results
    either way.

    > What I don't like is the inflexibility of having to collect all the
    > known occurrence of cluster Jamos and giving each of them the
    primary
    > weight in such a way (interleaving) that they can get collated the
    way
    > expected by (South) Koreans

    Yes, (a) my preference as well, as I stated. It is more flexible,
    since it works for any repertoire. It may or may not yield longer sort
    keys, depending on whether the sort keys are compressed or not (as in
    ICU). The issue is that a small set of characters will compress
    better, even if the starting weight sequences are longer.

    >With T < V < L, why would we need to terminate L+, V+ and T+
    separately
    > instead of just 'L+V+T*' as a whole? Or am I missing something
    obvious?

    2. No, that is the purpose of the ordering of those weights. For the
    syllable weighting, if you order T < V < L, you don't have to
    terminate a sequence of T's or a sequence of V's or a sequence of L's,
    just the whole syllable.

    3. You don't need to terminate the whole syllable, IF you are willing
    to have all sequences of L's contract (as in 1b). What you do there is
    contract all Ls, and then weight L <... TRAILING_WEIGHT < T < V. In
    that case, all syllables are of the form L V+ T*. For any sequences of
    V and T, the ordering works. Any L or nonJamo will sort less than T
    and V, so

    L V1 X
    L V1 V

    for any L, V1, V, X = non-Jamo or L. Same with T. However, the
    disadvantage of this approach is that you do have to contract all of
    the possible sequences of L's that you care about.

    You don't have to contract sequences of T's or V's, since those are
    taken care of by the T < V weighting. E.g.

    L V1 X
    L V1 T
    L V1 V

    > enumerating all equivalent sequences but just giving primary weights
    to
    > only 'basic' Jamos and requiring a preprocessing in which cluster
    jamos
    > are decomposed into sequences of basic Jamos.

    Preprocessing (on a string basis) is *deadly* for performance. It is
    also not necessary. The weight tables already allow characters to
    expand, that is what would be done in this case: it is just 1a above.

    Märk Dāvĭs
    ________
    mark.davis@jtcsv.com
    IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
    (408) 256-3148
    fax: (408) 256-0799

    ----- Original Message -----
    From: "Jungshik Shin" <jshin@mailaps.org>
    To: "Unicode Mailing List" <unicode@unicode.org>
    Sent: Friday, May 16, 2003 15:33
    Subject: Re: Proposed Update of UTS #10: Unicode Collation Algorithm

    > On Mon, 12 May 2003, Mark Davis wrote:
    >
    > Thank you for your detailed reply.
    >
    > > > are not listed in 7.1.4). However, it seems to me that some
    of
    > > these
    > > > customizations/tailoring in 7.1.4 are not necessary if an
    > > additional step
    > > > of preprocssing (in which clusters jamos are decomposed into
    > > sequences of
    > > > basic jamos) is taken as was proposed by Kent in his paper in
    > > 2001-2002.
    > >
    > > It is also a question of cost. Rearranging the weights so that T <
    V <
    > > L doesn't cost anything in implementations of the algorithm.
    >
    > Yes, I'm also thinking in terms of cost and flexibility. I have
    > no objection to rearranging the weights so that T < V < L and didn't
    > express any in my previous message. That's a very good idea.
    >
    > What I don't like is the inflexibility of having to collect all
    the
    > known occurrence of cluster Jamos and giving each of them the
    primary
    > weight in such a way (interleaving) that they can get collated the
    way
    > expected by (South) Koreans. When a new cluster jamo is added to
    the
    > repertoire, it's likely that tailoring has to be made again. It
    wouldn't
    > cost anything at the run-time, but it costs something to retailor
    > them. Because it's rare that we have to add new clusters, this may
    not
    > be a realistic concern. Still I find it rather inelegant and not in
    line
    > with the basic principles of Korean script that its inventhors had
    in
    > mind.
    >
    >
    > > Terminating each of the subclusters would.
    >
    > With T < V < L, why would we need to terminate L+, V+ and T+
    separately
    > instead of just 'L+V+T*' as a whole? Or am I missing something
    obvious?
    >
    > > > As for condition B.2 in 7.1.4, an alternative to that is just
    adding
    > > > a terminator primary weight to only Hangul syllables without
    optional
    > > > T('s). This terminator primary weight should be less than the
    primary
    > > > weight for any Ts (and that of any V's and Ls by condition A.)
    >
    > I was wrong. Any syllable, with or without optional T('s), has
    > to be terminated.
    >
    > > > As for condition B.1.a, I'm wondering why only L's are
    mentioned.
    > > > The same (contraction) should be applied to multiple V's and T's
    as well.
    > > > In addition, in the paragraph that begins with
    > >
    > > 2. The same goes for:
    > >
    > > L V T
    > > L V V
    > >
    > > With all V's greater than all T's, then any sequences that are
    equal
    > > up to the T/V comparison will take the right ordering.
    >
    > Well, I might not have been very clear that I wasn't so much
    > concerned with the handling of inter-syllble (or Hangul syllable
    followed
    > by non-Hangul) issue as with intra-syllable (or more precisely,
    > 'inter-vowel', 'inter-leading consonants', and 'inter-trailing
    > consonants' ) issues because the former is already well taken care
    of
    > by a prescription or the other suggested in the draft.
    >
    > What I was questioning was why the *contraction* (that should be
    > applied to seuqneces of V's and T's as well as seuqnces of L's) are
    > mentioned only about sequences of L's. For instance, suppose that
    > we have three sequences LV1, LV2, and LV1V4 where V2 is a cluster of
    > V1 and V3 and that the desired collation among them is LV1 < LV2 (=
    > LV1V3) < LV1V4. Without contracting V1V4 and giving it an
    indepenent
    > primary weight larger than that of V2, they'd be sorted LV1 < LV1V4
    <
    > LV2, instead. As a real example, consider the following three
    sequences.
    >
    > S1: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG
    O)
    > U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
    > S2: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+116A (ㅘ:HANGUL JUNGSEONG
    WA)
    > U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
    > S3: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG
    O)
    > U+1163 (ㅑ:HANGUL JUNGSEONG YA) U+11A8 (ㄱ:HANGUL JONGSEONG
    KIYEOK)
    >
    > With the primary weights of each Jamo given as following,
    >
    > U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) : 301
    > U+1161 (ㅏ:HANGUL JUNGSEONG A) : 201
    > U+1163 (ㅑ:HANGUL JUNGSEONG YA) : 231
    > U+1169 (ㅗ:HANGUL JUNGSEONG O) : 251
    > U+116A (ㅘ:HANGUL JUNGSEONG WA) : 255
    > U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK) : 101
    >
    > their primary weight sequences will be [301,251,101], [301,255,101]
    and
    > [301,251,231,101], respectively and they'll be sorted S1 < S3 < S2
    instead
    > of the correct S1 < S2 < S3 if there's no contraction applied to
    'U+1169
    > (ㅗ:HANGUL JUNGSEONG O) U+1163 (ㅑ:HANGUL JUNGSEONG YA)' sequence.
    > By contracting 'U+1169 (ㅗ:HANGUL JUNGSEONG O) U+1163 (ㅑ:HANGUL
    > JUNGSEONG YA)' and giving it an independent primary weight larger
    than
    > that of U+116A (ㅘ:HANGUL JUNGSEONG WA) 255 (say, 257), they will be
    > sorted S1 < S2 < S3.
    >
    > However, we can avoid this entirely if we just decompose the cluster
    vowel
    > 'U+116A (ㅘ:HANGUL JUNGSEONG WA)' to 'U+1169 (ㅗ:HANGUL JUNGSEONG O)
    > U+1161 (ㅏ:HANGUL JUNGSEONG A)' and do not give the primary weight to
    > it. Then we have <301,251, 101>, <301, 251, 201, 101> and <301, 251,
    > 231, 101>, which leads them to collate S1 < S2 < S3 as desired.
    >
    >
    > < a good explanation abuot *inter-syllable* issues snipped >
    >
    > > > For condition B.1.a, this means that if L1 has a primary......
    > > >
    > > > I think 'L1', 'L2' and 'L1L1' have to be replaced by Li, Lj, and
    > > > LiLk
    > > > where w(Li) < w(Lj). With that change, it's clear that B.1.a.
    can
    > > > be applied to cases like the one involving U+1105 (ᄅ : HANGUL
    > > > CHOSEONG RIEUL), the sequence of U+1105(ᄅ : HANGUL CHOSEONG
    RIEUL)
    > > > and U+1106(ᄆ : HANGUL CHOSEONG MIEUM) [1] and U+111A(ᄚ : HANGUL
    > > > CHOSEONG RIEUL-HIEUH).
    > >
    > > L1, L2 are simply variables standing for particular L's; the only
    > > reason for that is to stress where they are equal in two different
    > > cases. So it is just a terminology difference from Li, Lj.
    >
    > 'L1L1' would be interpreted as two identical Ls in a row (doublet
    of
    > L1). My point is that they can be different as well (see the example
    > given above). Using 'LiLk'(or L1L3 if you prefer) instead of 'L1L1'
    > makes it clear, doesn't it?
    >
    > > > Another missing part in my eyes is as to how to deal with
    U+111A(ᄚ :
    > > > HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105(ᄅ :
    HANGUL
    > > > CHOSEONG RIEUL) and U+1112(ᄒ: HANGUL CHOSEONG HIEUH). IMO, they
    > > > should be treated identically, but UTS 10(draft) is rather
    silent on
    > > > that perhaps deferring to tailorings.
    > >
    > > I agree that longer sequences should expand in weights to be
    > > equivalent, and that this should be done in the UCA. As I said, it
    is
    > > just taking a while working with WG20*, and in the meantime people
    > > need to tailor it.
    >
    > Thanks again for your effort to put things into order in
    cooperation
    > with WG20 and I hope WG20 will be able to work together with the UTC
    > about this issue soon.
    >
    > IMHO, the most elegant (not necessarily the most efficient and
    > sound from the engineering point of view [1]) way to do it is not
    > enumerating all equivalent sequences but just giving primary weights
    to
    > only 'basic' Jamos and requiring a preprocessing in which cluster
    jamos
    > are decomposed into sequences of basic Jamos. As mentioned above,
    in
    > addition to this, primary weights are assigned to satifsy the
    condition
    > that L > V > T > [syl_terminator], which is already listed in the
    draft.
    >
    > In a sense, this preprocessing ( which is not a part of any
    Unicode
    > normalization) is similar to Thai/Lao reordering. Anyway, I'm hoping
    that
    > the normalization tailoring currently under review will be approved
    so
    > that we'll be able to represent/deal with Korean script in Unicode
    in
    > a way that is more in line with what inventors of the script
    envisioned
    > in the 15th century than we can now.
    >
    > Jungshik
    >
    >
    > [1] UTS #10 can mention that if the repertoire of Hangul cluster
    jamos
    > is known a priori, the preprocessing can be avoided by a tailoring
    in
    > which all cluster jamos in the repertoir are contracted and
    assigned
    > independent primary weights that interleaves with basic Jamos. This
    is
    > rather similar to what it mentions about a possible shortcut that
    can be
    > taken about Hangul precomposed syllables when no Hangul Jamo is
    present
    > in the repertoire.
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 20:06:58 EDT