UTS #10 : comment on Hangul Jamo(Letter) collation

From: Jungshik Shin (jshin@mailaps.org)
Date: Mon Aug 25 2003 - 12:38:17 EDT

  • Next message: Michael Everson: "Re: Character codes for Egyptian transliteration"

    Hello,

    I've just submitted via the web feedback form at Unicode.org the following
    comment on Hangul Letter(Jamos) collation in UTS #10. I believe most, if
    not all, issues were resolved at least between Mark and me back in May,
    but nonetheless I guess it has to be formally submitted to be considered
    by the UTC. I'm also sending it to the Unicode list and WG20 list because
    I'm afraid in the web form, lines were wrapped rather badly, which makes
    it a bit hard to read my submission.

    Jungshik

    P.S. My email forwarding service provider has some trouble keeping the
    machine up with flood of emails (infected with W32/Sobig.F) I've been
    getting (at the peak, it was 50/minute). I was taken off the unicode
    list last weekend and had to resubscribe. Please, use
    jshin aet i18nl10 daht com if you want to reply to me off-line.

    P.P.S. My comment is geared toward the collation as widely used in South
    Korea. North Korea uses a different sorting order, which requires
    a separate tailoring as outlined in Kent's work.

    Enc. my comment on UTS #10.

    Re: Public issue #14 Unicode Collation Algorithm 4.0.0 Beta

    Sorting Hangul letters (Jamos) according to the current version
    of allkeys.txt is rather like sorting Latin letters according to
    the Unicode 4.0 code points. Because this is well known, UTS #10
    goes to a length to explain how to properly Hangul letters(Jamos).
    However, as it stands, there are a few issues to be clarified.

    In mid May this year after a proposed update of UTS #10 had been posted,
    there was a thread of discussion about treatment of Hangul letters (Jamos)
    in UCA. In the thread, I raised the following issue (interleaving issue
    and different treatment of cluster jamos depending on whether they're
    given separate code points of their own in U+1100 block or they have to
    be represented as sequences of Jamos encoded).

    After a thread of emails exchanged, Mark Davis and I found that both of us
    are more or less in the same page as to how Hangul letters be collated.
    In summary,

      1. Weights for T, V, and L should be assigned in such a way that
         T < V < L for all T, V, and L's

      2. Expand precomposed (cluster) Jamos into sequences of component
         basic Jamos

      3. Terminate every syllable with 'TERM' that has a lower weight than
         all T's (there's an alternative to this, but both favors this
         more than the alternative)

    While Hangul collation issue is being worked out with ISO/IEC
    JTC1/SC22/WG20, I'd like the above tailoring (which is rather straightforward
    in my opinion) to be laid out clearly in UTS #10 along with alternatives
    (if the authors wish to). I'm also wondering if allkeys.txt with the
    above tailoring can be released.

    Thank you for your consideration.

    P.S. The following is a recap of emails exchanged about the issues.

    JS> Specifically, U+1102 (ᄂ) HANGUL CHOSEONG NIEUN, U+1103 (ᄃ) HANGUL
    JS> CHOSEONG TIKEUT and U+1113 (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK are given
    JS> the primary weight of 1832, 1833 and 1844, respectively. With these,
    JS> U+1113 (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK will be sorted after U+1103
    JS> (ᄃ) HANGUL CHOSEONG TIKEUT, right? Or am I missing something (I
    JS> haven't read UTS #10 through, yet)?
    JS>
    JS> The order is different from the way (South) Koreans (at least, most
    JS> Korean dictionary editors) expect them to be sorted. We expect U+1113
    JS> (ᄓ) HANGUL CHOSEONG NIEUN-KIYEOK (and other cluster consonants whose
    JS> first component is U+1102 (ᄂ) HANGUL CHOSEONG NIEUN. They're U+1114
    JS> (ᄔ) HANGUL CHOSEONG SSANGNIEUN, U+1115 (ᄕ) HANGUL CHOSEONG
    JS> NIEUN-TIKEUT, U+1116 (ᄖ) HANGUL CHOSEONG NIEUN-PIEUP) to be put after
    JS> U+1102 (ᄂ) HANGUL CHOSEONG NIEUN but before U+1103 (ᄃ) HANGUL CHOSEONG
    JS> TIKEUT. The same is true of any cluster Jamos.

    JS> In the first approach, the treatment of cluster Jamos depends on
    JS> whether they're assigned separate code points or not. For instance,
    JS> U+1113(ᄓ : HANGUL CHOSEONG NIEUN-KIYEOK) is treated in a different
    JS> way from a cluster Jamo (HANGUL CHOSEONG NIEUN-SIOS) of which the only
    JS> possible representation is the sequence of U+1102(ᄂ : HANGUL CHOSEONG
    JS> NIEUN) and U+1109(ᄉ : HANGUL CHOSEONG SIOS) [1]. Moreover, depending on
    JS> implementations, U+1113(ᄓ : HANGUL CHOSEONG NIEUN-KIYEOK) and the
    JS> sequence of U+1102(ᄂ : HANGUL CHOSEONG NIEUN) and U+1109 (ᄀ : HANGUL
    JS> CHOSEONG KIYEOK) can be treated differently. This is in contrast
    JS> to the treatment of Latin/Greek/Cyrillic letters with diacritic marks.
    JS> For them, whether precomposed letters (base + diacritic marks) are
    JS> separately encoded or not and whether they're represented by precomposed
    JS> characters or base + diacritics don't affect their collation.

    Mark Davis responed to that as following:

    MD> 1. If you reorder all T < V < L, then when you get a sequence:
    MD>
    MD> L V
    MD> L L
    MD>
    MD> and the L's are equal, then the second is always greater.
    MD>
    MD> 2. The same goes for:
    MD>
    MD> L V T
    MD> L V V
    MD>
    MD> With all V's greater than all T's, then any sequences that are equal
    MD> up to the T/V comparison will take the right ordering.
    MD>
    MD> 3. The problem is then only with sequences like:
    MD>
    MD> L V X
    MD> L V T
    MD>
    MD> If X is not a Jamo, or starts a new syllable, then you have to make
    MD> sure that X is always less than T. There are two ways to do this:
    MD>
    MD> 3a. terminate every syllable.
    MD> 3b. make V & T higher than all X (including L).

    JS> Another missing part in my eyes is as to how to deal with U+111A(ᄚ :
    JS> HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105(ᄅ : HANGUL
    JS> CHOSEONG RIEUL) and U+1112(ᄒ: HANGUL CHOSEONG HIEUH). IMO, they
    JS> should be treated identically, but UTS 10(draft) is rather silent on
    JS> that perhaps deferring to tailorings.

    Further along, he also wrote, in response to my question (as shown right
    above), that [1]

    MD> 1. For the "precomposed" jamos, there are two solutions.
    MD>
    MD> Suppose we have:
    MD>
    MD> U+1105(ᄅ) HANGULCHOSEONG RIEUL) => X
    MD> U+1112(ᄒ: HANGUL CHOSEONG HIEUH) => Y
    MD>
    MD> a. decompose them.
    MD>
    MD> U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X Y

    MD> b. interleave them and treat their constitutent sequences as
    MD> contractions.
    MD>
    MD> U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X'
    MD> U+1105(ᄅ) HANGULCHOSEONG RIEUL), U+1112(ᄒ: HANGUL CHOSEONG HIEUH)
    MD> => X'

    In addition, he wrote that he's more in favor of (a) than (b). I also wrote
    that I prefer (a) to (b) because of the following problem with (b).

    JS> What I don't like is the inflexibility of having to collect all the
    JS> known occurrence of cluster Jamos and giving each of them the
    JS> primary weight in such a way (interleaving) that they can get
    JS> collated the way expected by (South) Koreans

    Mark also wrote the following, which I missed at first. As a result,
    I wrote some more articles [2] until it's finally clarified in
    the last article in the thread [3]

    MD> I agree that longer sequences should expand in weights to be
    MD> equivalent, and that this should be done in the UCA. As I said, it is
    MD> just taking a while working with WG20*, and in the meantime people
    MD> need to tailor it.

    [1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0362.html
    [2] http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0364.html

    JS> To take the same example as I took in my previous email, I don't see
    JS> how S1,S2 and S3 could be sorted S1 < S2 < S3 (instead of S1 < S3 < S2)
    JS> without contracting the sequence of 'U+1169 (ㅗ:HANGUL JUNGSEONG O)
    JS> U+1163 (ㅑ:HANGUL JUNGSEONG YA)'?
    JS>
    JS> S1: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG O)
    JS> U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
    JS> S2: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+116A (ㅘ:HANGUL JUNGSEONG WA)
    JS> U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
    JS> S3: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG O)
    JS> U+1163 (ㅑ:HANGUL JUNGSEONG YA) U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
    JS>
    JS> where the primary weights of each Jamo are given as following,
    JS>
    JS> U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) : 301
    JS> U+1161 (ㅏ:HANGUL JUNGSEONG A) : 201
    JS> U+1163 (ㅑ:HANGUL JUNGSEONG YA) : 231
    JS> U+1169 (ㅗ:HANGUL JUNGSEONG O) : 251
    JS> U+116A (ㅘ:HANGUL JUNGSEONG WA) : 255
    JS> U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK) : 101

    [3]
    http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0426.html



    This archive was generated by hypermail 2.1.5 : Mon Aug 25 2003 - 13:48:43 EDT