L2/03-291 Subject: Public issue #14 Unicode Collation Algorithm 4.0.0 Beta Date/Time: Mon Aug 25 04:27:01 EDT 2003 Contact: jshin@jtan.com ( I'm also sending this to the Unicode mailing list because what's sent to the list may have a better format than what you end up getting after the form processing) Sorting Hangul letters (Jamos) according to the current version of allkeys.txt is rather like sorting Latin letters according to the Unicode 4.0 code points. Because this is well known, UTS #10 goes to a length to explain how to properly Hangul letters(Jamos). However, as it stands, there are a few issues to be clarified. In mid May this year after a proposed update of UTS #10 had been posted, there was a thread of discussion about treatment of Hangul letters (Jamos) in UCA. In the thread, I raised the following issue (interleaving issue and different treatment of cluster jamos depending on whether they're given separate code points of their own in U+1100 block or they have to be represented as sequences of Jamos encoded). After a thread of emails exchanged, Mark Davis and I found that both of us are more or less in the same page as to how Hangul letters be collated. In summary, 1. Weights for T, V, and L should be assigned in such a way that T < V < L for all T, V, and L's 2. Expand precomposed (cluster) Jamos into sequences of component basic Jamos 3. Terminate every syllable with 'TERM' that has a lower weight than all T's (there's an alternative to this, but both favors this more than the alternative) While Hangul collation issue is being worked out with ISO/IEC JTC1/SC22/WG20, I'd like the above tailoring (which is rather straightforward in my opinion) to be laid out clearly in UTS #10 along with alternatives (if the authors wish to). I'm also wondering if allkeys.txt with the above tailoring can be released. Thank you for your consideration. P.S. The following is a recap of emails exchanged about the issues. JS> Specifically, U+1102 ( ) HANGUL CHOSEONG NIEUN, U+1103 ( ) HANGUL JS> CHOSEONG TIKEUT and U+1113 ( ) HANGUL CHOSEONG NIEUN-KIYEOK are given JS> the primary weight of 1832, 1833 and 1844, respectively. With these, JS> U+1113 ( ) HANGUL CHOSEONG NIEUN-KIYEOK will be sorted after U+1103 JS> ( ) HANGUL CHOSEONG TIKEUT, right? Or am I missing something (I JS> haven't read UTS #10 through, yet)? JS> JS> The order is different from the way (South) Koreans (at least, most JS> Korean dictionary editors) expect them to be sorted. We expect U+1113 JS> ( ) HANGUL CHOSEONG NIEUN-KIYEOK (and other cluster consonants whose JS> first component is U+1102 ( ) HANGUL CHOSEONG NIEUN. They're U+1114 JS> ( ) HANGUL CHOSEONG SSANGNIEUN, U+1115 ( ") HANGUL CHOSEONG JS> NIEUN-TIKEUT, U+1116 ( ) HANGUL CHOSEONG NIEUN-PIEUP) to be put after JS> U+1102 ( ) HANGUL CHOSEONG NIEUN but before U+1103 ( ) HANGUL CHOSEONG JS> TIKEUT. The same is true of any cluster Jamos. JS> In the first approach, the treatment of cluster Jamos depends on JS> whether they're assigned separate code points or not. For instance, JS> U+1113( : HANGUL CHOSEONG NIEUN-KIYEOK) is treated in a different JS> way from a cluster Jamo (HANGUL CHOSEONG NIEUN-SIOS) of which the only JS> possible representation is the sequence of U+1102( : HANGUL CHOSEONG JS> NIEUN) and U+1109( 0 : HANGUL CHOSEONG SIOS) [1]. Moreover, depending on JS> implementations, U+1113( : HANGUL CHOSEONG NIEUN-KIYEOK) and the JS> sequence of U+1102( : HANGUL CHOSEONG NIEUN) and U+1109 ( : HANGUL JS> CHOSEONG KIYEOK) can be treated differently. This is in contrast JS> to the treatment of Latin/Greek/Cyrillic letters with diacritic marks. JS> For them, whether precomposed letters (base + diacritic marks) are JS> separately encoded or not and whether they're represented by precomposed JS> characters or base + diacritics don't affect their collation. Mark Davis responed to that as following: MD> 1. If you reorder all T < V < L, then when you get a sequence: MD> MD> L V MD> L L MD> MD> and the L's are equal, then the second is always greater. MD> MD> 2. The same goes for: MD> MD> L V T MD> L V V MD> MD> With all V's greater than all T's, then any sequences that are equal MD> up to the T/V comparison will take the right ordering. MD> MD> 3. The problem is then only with sequences like: MD> MD> L V X MD> L V T MD> MD> If X is not a Jamo, or starts a new syllable, then you have to make MD> sure that X is always less than T. There are two ways to do this: MD> MD> 3a. terminate every syllable. MD> 3b. make V & T higher than all X (including L). JS> Another missing part in my eyes is as to how to deal with U+111A( a : JS> HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105( & : HANGUL JS> CHOSEONG RIEUL) and U+1112( : HANGUL CHOSEONG HIEUH). IMO, they JS> should be treated identically, but UTS 10(draft) is rather silent on JS> that perhaps deferring to tailorings. Further along, he also wrote, in response to my question (as shown right above), that [1] MD> 1. For the "precomposed" jamos, there are two solutions. MD> MD> Suppose we have: MD> MD> U+1105( &) HANGULCHOSEONG RIEUL) => X MD> U+1112( : HANGUL CHOSEONG HIEUH) => Y MD> MD> a. decompose them. MD> MD> U+111A( a HANGUL CHOSEONG RIEUL-HIEUH) => X Y MD> b. interleave them and treat their constitutent sequences as MD> contractions. MD> MD> U+111A( a HANGUL CHOSEONG RIEUL-HIEUH) => X' MD> U+1105( &) HANGULCHOSEONG RIEUL), U+1112( : HANGUL CHOSEONG HIEUH) MD> => X' In addition, he wrote that he's more in favor of (a) than (b). I also wrote that I prefer (a) to (b) because of the following problem with (b). JS> What I don't like is the inflexibility of having to collect all the JS> known occurrence of cluster Jamos and giving each of them the JS> primary weight in such a way (interleaving) that they can get JS> collated the way expected by (South) Koreans Mark also wrote the following, which I missed at first. As a result, I wrote some more articles [2] until it's finally clarified in the last article in the thread [3] MD> I agree that longer sequences should expand in weights to be MD> equivalent, and that this should be done in the UCA. As I said, it is MD> just taking a while working with WG20*, and in the meantime people MD> need to tailor it. [1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0362.html [2] http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0364.html JS> To take the same example as I took in my previous email, I don't see JS> how S1,S2 and S3 could be sorted S1 < S2 < S3 (instead of S1 < S3 < S2) JS> without contracting the sequence of 'U+1169 (& :HANGUL JUNGSEONG O) JS> U+1163 (& :HANGUL JUNGSEONG YA)'? JS> JS> S1: U+1100 ( :HANGUL CHOSEONG KIYEOK) U+1169 (& :HANGUL JUNGSEONG O) JS> U+11A8 ( :HANGUL JONGSEONG KIYEOK) JS> S2: U+1100 ( :HANGUL CHOSEONG KIYEOK) U+116A (&:HANGUL JUNGSEONG WA) JS> U+11A8 ( :HANGUL JONGSEONG KIYEOK) JS> S3: U+1100 ( :HANGUL CHOSEONG KIYEOK) U+1169 (& :HANGUL JUNGSEONG O) JS> U+1163 (& :HANGUL JUNGSEONG YA) U+11A8 ( :HANGUL JONGSEONG KIYEOK) JS> JS> where the primary weights of each Jamo are given as following, JS> JS> U+1100 ( :HANGUL CHOSEONG KIYEOK) : 301 JS> U+1161 (&:HANGUL JUNGSEONG A) : 201 JS> U+1163 (& :HANGUL JUNGSEONG YA) : 231 JS> U+1169 (& :HANGUL JUNGSEONG O) : 251 JS> U+116A (&:HANGUL JUNGSEONG WA) : 255 JS> U+11A8 ( :HANGUL JONGSEONG KIYEOK) : 101 [3] http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0426.html