Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Jungshik Shin (jshin@mailaps.org)
Date: Mon May 12 2003 - 04:08:27 EDT

  • Next message: Jungshik Shin: "Re: visible glyphs for U+2062 and similar characters"

    On Sun, 11 May 2003, Mark Davis wrote:

    > Here is your question, reformatted to always include real characters
    > and names.*

      Thank you for reformatting. I have no problem adding real characters(
    naturally, it's a lot easier for me to type in real characters than code
    points), but some people have trouble with real characters in UTF-8 even
    on this list so that I just followed the safest way :-) (especially,
    I hate to receive their responses mislabelling UTF-8 as ISO-8859-1 and
    other MIME charsets.) Well, this cannot be an execuse for not including
    the character names. (perhaps, I have to write a simple perl script to
    convert any Unicode character in a given range(the default would be any
    character above U+007F.) to 'U+xxxx (real character) Unicode Name'.

    > > Specifically, U+1102 (ᄂ) HANGUL CHOSEONG NIEUN, U+1103 (ᄃ) HANGUL
    > > CHOSEONG TIKEUT and U+1113 (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK are given
    > > the primary weight of 1832, 1833 and 1844, respectively. With these,
    > > U+1113 (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK will be sorted after U+1103
    > > (ᄃ) HANGUL CHOSEONG TIKEUT, right? Or am I missing something (I
    > > haven't read UTS #10 through, yet)?

    > >The order is different from the way (South) Koreans (at least, most
    > > Korean dictionary editors) expect them to be sorted. We expect U+1113
    > > (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK (and other cluster consonants whose
    > > first component is U+1102 (ᄂ) HANGUL CHOSEONG NIEUN. They're U+1114
    > > (ᄔ) HANGUL CHOSEONG SSANGNIEUN, U+1115 (ᄕ) HANGUL CHOSEONG
    > > NIEUN-TIKEUT, U+1116 (ᄖ) HANGUL CHOSEONG NIEUN-PIEUP) to be put after
    > > U+1102 (ᄂ) HANGUL CHOSEONG NIEUN but before U+1103 (ᄃ) HANGUL CHOSEONG
    > > TIKEUT. The same is true of any cluster Jamos.

    > > Is it UTC's intention to leave the task of making Hangul Jamos
    > > collate in accordance with (South) Koreans' expectation to (South)
    > > Korean specific tailoring?

    > We have been trying to work with the WG20 committee to resolve them,
    > due to a desire to maintain synchrony with ISO 14651 in weights.

       Thank you for your effort in this regard.

    > In the meantime, the work-around is to tailor the Jamo characters to
    > interleave the characters properly,

      Another way is to decompose all cluster Jamos into a sequence of
    basic Jamos and assign weights to _only_ basic Jamos, which you don't
    seem to be very fond of apparently because their decomposition is not
    included even in the compatibility decomposition in Unicode 3.0 and up
    (although it was in Unicode 2.0). The difference between two approach
    is :

      In the first approach, the treatment of cluster Jamos depends on
    whether they're assigned separate code points or not. For instance,
    U+1113(ᄓ : HANGUL CHOSEONG NIEUN-KIYEOK) is treated in a different
    way from a cluster Jamo (HANGUL CHOSEONG NIEUN-SIOS) of which the only
    possible representation is the sequence of U+1102(ᄂ : HANGUL CHOSEONG
    NIEUN) and U+1109(ᄉ : HANGUL CHOSEONG SIOS) [1]. Moreover, depending on
    implementations, U+1113(ᄓ : HANGUL CHOSEONG NIEUN-KIYEOK) and the
    sequence of U+1102(ᄂ : HANGUL CHOSEONG NIEUN) and U+1109 (ᄀ : HANGUL
    CHOSEONG KIYEOK) can be treated differently. This is in contrast
    to the treatment of Latin/Greek/Cyrillic letters with diacritic marks.
    For them, whether precomposed letters (base + diacritic marks) are
    separately encoded or not and whether they're represented by precomposed
    characters or base + diacritics don't affect their collation.

      If we have the full/exhaustive list of all possible combinations
    of Jamo sequences (or we deal with the limited repertoire as seems to be
    assumed), it's possible to assign weights in such a way that differences
    of two kinds mentioned above can be made 'nill'. Even if we don't
    (as is allowed in Unicode), you may have a clver method or two (that
    are not listed in 7.1.4). However, it seems to me that some of these
    customizations/tailoring in 7.1.4 are not necessary if an additional step
    of preprocssing (in which clusters jamos are decomposed into sequences of
    basic jamos) is taken as was proposed by Kent in his paper in 2001-2002.

    > and follow one of the approaches
    > in UCA 7.1.4 at
    > http://www.unicode.org/reports/tr10/tr10-10.html#Trailing_Weights.

      Actually, I read that part before writting my message, but I didn't
    mention it (deciding to write about details of that part later) partly
    because I don't see how that part _alone_ solves the issue I raised as
    you recognized.

    As for condition B.2 in 7.1.4, an alternative to that is just adding
    a terminator primary weight to only Hangul syllables without optional
    T('s). This terminator primary weight should be less than the primary
    weight for any Ts (and that of any V's and Ls by condition A.)

    As for condition B.1.a, I'm wondering why only L's are mentioned.
    The same (contraction) should be applied to multiple V's and T's as well.
    In addition, in the paragraph that begins with

      For condition B.1.a, this means that if L1 has a primary......

    I think 'L1', 'L2' and 'L1L1' have to be replaced by Li, Lj, and LiLk
    where w(Li) < w(Lj). With that change, it's clear that B.1.a. can
    be applied to cases like the one involving U+1105 (ᄅ : HANGUL
    CHOSEONG RIEUL), the sequence of U+1105(ᄅ : HANGUL CHOSEONG RIEUL)
    and U+1106(ᄆ : HANGUL CHOSEONG MIEUM) [1] and U+111A(ᄚ : HANGUL
    CHOSEONG RIEUL-HIEUH).

    Another missing part in my eyes is as to how to deal with U+111A(ᄚ :
    HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105(ᄅ : HANGUL
    CHOSEONG RIEUL) and U+1112(ᄒ: HANGUL CHOSEONG HIEUH). IMO, they
    should be treated identically, but UTS 10(draft) is rather silent on
    that perhaps deferring to tailorings.

    > Thanks for bringing this interleaving issue up; we should add a
    > description to section 7.1.4.

      That will be nice.

    [1] I'm not making up these sequences. MS Office XP and Uniscribe support
    this sequence (see
    http://www.microsoft.com/typography/otfntdev/hangulot/appen.htm).
    PARK Won-kyu with my help also has developed a GPL'd opentype font
    that supports this sequence along with many others (and will release
    a few more). There's a Mozilla patch to support them across platforms
    and Pango patch was/is being made.



    This archive was generated by hypermail 2.1.5 : Mon May 12 2003 - 05:04:44 EDT