Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Mark Davis (
Date: Mon May 12 2003 - 16:52:03 EDT

  • Next message: "Re: visible glyphs for U+2062 and similar characters"

    > are not listed in 7.1.4). However, it seems to me that some of
    > customizations/tailoring in 7.1.4 are not necessary if an
    additional step
    > of preprocssing (in which clusters jamos are decomposed into
    sequences of
    > basic jamos) is taken as was proposed by Kent in his paper in

    It is also a question of cost. Rearranging the weights so that T < V <
    L doesn't cost anything in implementations of the algorithm.
    Terminating each of the subclusters would.

    > As for condition B.2 in 7.1.4, an alternative to that is just adding
    > a terminator primary weight to only Hangul syllables without
    > T('s). This terminator primary weight should be less than the
    > weight for any Ts (and that of any V's and Ls by condition A.)
    > As for condition B.1.a, I'm wondering why only L's are mentioned.
    > The same (contraction) should be applied to multiple V's and T's as
    > In addition, in the paragraph that begins with

    1. If you reorder all T < V < L, then when you get a sequence:

    L V
    L L

    and the L's are equal, then the second is always greater.

    2. The same goes for:

    L V T
    L V V

    With all V's greater than all T's, then any sequences that are equal
    up to the T/V comparison will take the right ordering.

    3. The problem is then only with sequences like:

    L V X
    L V T

    If X is not a Jamo, or starts a new syllable, then you have to make
    sure that X is always less than T. There are two ways to do this:

    3a. terminate every syllable.
    3b. make V & T higher than all X (including L).

    #3b involves reversing #1; instead of ordering L higher, you have to
    make it lower. In that case, you have to have all multi-L sequence
    contract, in order to get the right ordering.

    I originally favored #3b, but after considering the different factors,
    now believe that 3a is better overall.

    > For condition B.1.a, this means that if L1 has a primary......
    > I think 'L1', 'L2' and 'L1L1' have to be replaced by Li, Lj, and
    > where w(Li) < w(Lj). With that change, it's clear that B.1.a. can
    > be applied to cases like the one involving U+1105 (ᄅ : HANGUL
    > CHOSEONG RIEUL), the sequence of U+1105(ᄅ : HANGUL CHOSEONG RIEUL)
    > and U+1106(ᄆ : HANGUL CHOSEONG MIEUM) [1] and U+111A(ᄚ : HANGUL

    L1, L2 are simply variables standing for particular L's; the only
    reason for that is to stress where they are equal in two different
    cases. So it is just a terminology difference from Li, Lj.

    > Another missing part in my eyes is as to how to deal with U+111A(ᄚ :
    > HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105(ᄅ : HANGUL
    > should be treated identically, but UTS 10(draft) is rather silent on
    > that perhaps deferring to tailorings.

    I agree that longer sequences should expand in weights to be
    equivalent, and that this should be done in the UCA. As I said, it is
    just taking a while working with WG20*, and in the meantime people
    need to tailor it.

    > > Thanks for bringing this interleaving issue up; we should add a
    > > description to section 7.1.4.
    > That will be nice.
    > [1] I'm not making up these sequences. MS Office XP and Uniscribe
    > this sequence (see
    > PARK Won-kyu with my help also has developed a GPL'd opentype font
    > that supports this sequence along with many others (and will release
    > a few more). There's a Mozilla patch to support them across
    > and Pango patch was/is being made.

    We know that: see (*) above.

    This archive was generated by hypermail 2.1.5 : Mon May 12 2003 - 17:42:45 EDT