L2/03-291

Subject: Public issue #14 Unicode Collation Algorithm 4.0.0 Beta
Date/Time:   Mon Aug 25 04:27:01 EDT 2003
Contact:    jshin@jtan.com


( I'm also sending this to the Unicode mailing list because what's
sent to the list may have a better format than what you end up getting
after the form processing)

Sorting Hangul letters (Jamos) according to the current version
of allkeys.txt is rather like sorting Latin letters according to
the Unicode 4.0 code points. Because this is well known, UTS #10
goes to a length to explain how to properly Hangul letters(Jamos).
However, as it stands, there are a few issues to be clarified.

In mid May this year after a proposed update of UTS #10 had been posted,
there was a thread of discussion about treatment of Hangul letters (Jamos)
in UCA. In the thread, I raised the following issue (interleaving issue
and different treatment of cluster jamos depending on whether they're
given separate code points of their own in U+1100 block or they have to
be represented as sequences of Jamos encoded).

After a thread of emails exchanged, Mark Davis and I found that both of us
are more or less in the same page as to how Hangul letters be collated.
In summary,

 1. Weights for T, V, and L should be assigned in such a way that
   T < V < L for all T, V, and L's

 2. Expand precomposed (cluster) Jamos into sequences of component
   basic Jamos

 3. Terminate every syllable with 'TERM' that has a lower weight than
   all T's (there's an alternative to this, but both favors this
   more than the alternative)

While Hangul collation issue is being worked out with ISO/IEC
JTC1/SC22/WG20, I'd like the above tailoring (which is rather straightforward
in my opinion) to be laid out clearly in UTS #10 along with alternatives
(if the authors wish to). I'm also wondering if allkeys.txt with the
above tailoring can be released.

Thank you for your consideration.

P.S. The following is a recap of emails exchanged about the issues.

JS> Specifically, U+1102 (  ) HANGUL CHOSEONG NIEUN, U+1103 ( ) HANGUL
JS> CHOSEONG TIKEUT and U+1113 (  ) HANGUL CHOSEONG NIEUN-KIYEOK are given
JS> the primary weight of 1832, 1833 and 1844, respectively. With these,
JS> U+1113 (  ) HANGUL CHOSEONG NIEUN-KIYEOK will be sorted after U+1103
JS> ( ) HANGUL CHOSEONG TIKEUT, right? Or am I missing something (I
JS> haven't read UTS #10 through, yet)?
JS>
JS> The order is different from the way (South) Koreans (at least, most
JS> Korean dictionary editors) expect them to be sorted. We expect U+1113
JS> (  ) HANGUL CHOSEONG NIEUN-KIYEOK (and other cluster consonants whose
JS> first component is U+1102 (  ) HANGUL CHOSEONG NIEUN. They're U+1114
JS> (  ) HANGUL CHOSEONG SSANGNIEUN, U+1115 ( ") HANGUL CHOSEONG
JS> NIEUN-TIKEUT, U+1116 (  ) HANGUL CHOSEONG NIEUN-PIEUP) to be put after
JS> U+1102 (  ) HANGUL CHOSEONG NIEUN but before U+1103 ( ) HANGUL CHOSEONG
JS> TIKEUT. The same is true of any cluster Jamos.

JS>  In the first approach, the treatment of cluster Jamos depends on
JS> whether they're assigned separate code points or not. For instance,
JS> U+1113(  : HANGUL CHOSEONG NIEUN-KIYEOK) is treated in a different
JS> way from a cluster Jamo (HANGUL CHOSEONG NIEUN-SIOS) of which the only
JS> possible representation is the sequence of U+1102(  : HANGUL CHOSEONG
JS> NIEUN) and U+1109( 0 : HANGUL CHOSEONG SIOS) [1]. Moreover, depending on
JS> implementations, U+1113(  : HANGUL CHOSEONG NIEUN-KIYEOK) and the
JS> sequence of U+1102(  : HANGUL CHOSEONG NIEUN) and U+1109 (  : HANGUL
JS> CHOSEONG KIYEOK) can be treated differently. This is in contrast
JS> to the treatment of Latin/Greek/Cyrillic letters with diacritic marks.
JS> For them, whether precomposed letters (base + diacritic marks) are
JS> separately encoded or not and whether they're represented by precomposed
JS> characters or base + diacritics don't affect their collation.


Mark Davis responed to that as following:

MD> 1. If you reorder all T < V < L, then when you get a sequence:
MD>
MD> L V
MD> L L
MD>
MD> and the L's are equal, then the second is always greater.
MD>
MD> 2. The same goes for:
MD>
MD> L V T
MD> L V V
MD>
MD> With all V's greater than all T's, then any sequences that are equal
MD> up to the T/V comparison will take the right ordering.
MD>
MD> 3. The problem is then only with sequences like:
MD>
MD> L V X
MD> L V T
MD>
MD> If X is not a Jamo, or starts a new syllable, then you have to make
MD> sure that X is always less than T. There are two ways to do this:
MD>
MD> 3a. terminate every syllable.
MD> 3b. make V & T higher than all X (including L).

JS> Another missing part in my eyes is as to how to deal with U+111A( a :
JS> HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105( & : HANGUL
JS> CHOSEONG RIEUL) and U+1112(  : HANGUL CHOSEONG HIEUH). IMO, they
JS> should be treated identically, but UTS 10(draft) is rather silent on
JS> that perhaps deferring to tailorings. 

Further along, he also wrote, in response to my question (as shown right
above), that [1]

MD> 1. For the "precomposed" jamos, there are two solutions.
MD>
MD> Suppose we have:
MD> 
MD> U+1105( &) HANGULCHOSEONG RIEUL) => X
MD> U+1112(  : HANGUL CHOSEONG HIEUH) => Y 
MD>
MD> a. decompose them.
MD>
MD> U+111A( a HANGUL CHOSEONG RIEUL-HIEUH) => X Y

MD> b. interleave them and treat their constitutent sequences as
MD> contractions.
MD> 
MD> U+111A( a HANGUL CHOSEONG RIEUL-HIEUH) => X'
MD> U+1105( &) HANGULCHOSEONG RIEUL), U+1112(  : HANGUL CHOSEONG HIEUH)
MD> => X'

In addition, he wrote that he's more in favor of (a) than (b). I also wrote
that I prefer (a) to (b) because of the following problem with (b).

JS> What I don't like is the inflexibility of having to collect all the
JS> known occurrence of cluster Jamos and giving each of them the
JS> primary weight in such a way (interleaving) that they can get
JS> collated the way expected by (South) Koreans 


Mark also wrote the following, which I missed at first. As a result,
I wrote some more articles [2] until it's finally clarified in
the last article in the thread [3]

MD> I agree that longer sequences should expand in weights to be
MD> equivalent, and that this should be done in the UCA. As I said, it is
MD> just taking a while working with WG20*, and in the meantime people
MD> need to tailor it.




[1] http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0362.html
[2] http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0364.html

JS> To take the same example as I took in my previous email, I don't see
JS> how S1,S2 and S3 could be sorted S1 < S2 < S3 (instead of S1 < S3 < S2)
JS> without contracting the sequence of 'U+1169 (& :HANGUL JUNGSEONG O)
JS> U+1163 (& :HANGUL JUNGSEONG YA)'?
JS>
JS>  S1: U+1100 ( :HANGUL CHOSEONG KIYEOK) U+1169 (& :HANGUL JUNGSEONG O)
JS>     U+11A8 ( :HANGUL JONGSEONG KIYEOK)
JS>  S2: U+1100 ( :HANGUL CHOSEONG KIYEOK) U+116A (&:HANGUL JUNGSEONG WA)
JS>     U+11A8 ( :HANGUL JONGSEONG KIYEOK)
JS>  S3: U+1100 ( :HANGUL CHOSEONG KIYEOK) U+1169 (& :HANGUL JUNGSEONG O)
JS>     U+1163 (& :HANGUL JUNGSEONG YA) U+11A8 ( :HANGUL JONGSEONG KIYEOK)
JS>
JS> where the primary weights of each Jamo are given as following,
JS>
JS>  U+1100 ( :HANGUL CHOSEONG KIYEOK) : 301
JS>  U+1161 (&:HANGUL JUNGSEONG A)    : 201
JS>  U+1163 (& :HANGUL JUNGSEONG YA)   : 231
JS>  U+1169 (& :HANGUL JUNGSEONG O)    : 251
JS>  U+116A (&:HANGUL JUNGSEONG WA)   : 255
JS>  U+11A8 ( :HANGUL JONGSEONG KIYEOK) : 101

[3]
http://www.unicode.org/mail-arch/unicode-ml/y2003-m05/0426.html