Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Jungshik Shin (jshin@mailaps.org)
Date: Fri May 16 2003 - 18:33:31 EDT

Next message: Jungshik Shin: "Re: Decimal separator with more than one character?"

Previous message: Ben Dougall: "Re: character groupings in various languages"
In reply to: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Next in thread: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Reply: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Mon, 12 May 2003, Mark Davis wrote:

Thank you for your detailed reply.

> > are not listed in 7.1.4). However, it seems to me that some of
> these
> > customizations/tailoring in 7.1.4 are not necessary if an
> additional step
> > of preprocssing (in which clusters jamos are decomposed into
> sequences of
> > basic jamos) is taken as was proposed by Kent in his paper in
> 2001-2002.
>
> It is also a question of cost. Rearranging the weights so that T < V <
> L doesn't cost anything in implementations of the algorithm.

Yes, I'm also thinking in terms of cost and flexibility. I have
no objection to rearranging the weights so that T < V < L and didn't
express any in my previous message. That's a very good idea.

What I don't like is the inflexibility of having to collect all the
known occurrence of cluster Jamos and giving each of them the primary
weight in such a way (interleaving) that they can get collated the way
expected by (South) Koreans. When a new cluster jamo is added to the
repertoire, it's likely that tailoring has to be made again. It wouldn't
cost anything at the run-time, but it costs something to retailor
them. Because it's rare that we have to add new clusters, this may not
be a realistic concern. Still I find it rather inelegant and not in line
with the basic principles of Korean script that its inventhors had in
mind.

> Terminating each of the subclusters would.

With T < V < L, why would we need to terminate L+, V+ and T+ separately
instead of just 'L+V+T*' as a whole? Or am I missing something obvious?

> > As for condition B.2 in 7.1.4, an alternative to that is just adding
> > a terminator primary weight to only Hangul syllables without optional
> > T('s). This terminator primary weight should be less than the primary
> > weight for any Ts (and that of any V's and Ls by condition A.)

I was wrong. Any syllable, with or without optional T('s), has
to be terminated.

> > As for condition B.1.a, I'm wondering why only L's are mentioned.
> > The same (contraction) should be applied to multiple V's and T's as well.
> > In addition, in the paragraph that begins with
>
> 2. The same goes for:
>
> L V T
> L V V
>
> With all V's greater than all T's, then any sequences that are equal
> up to the T/V comparison will take the right ordering.

Well, I might not have been very clear that I wasn't so much
concerned with the handling of inter-syllble (or Hangul syllable followed
by non-Hangul) issue as with intra-syllable (or more precisely,
'inter-vowel', 'inter-leading consonants', and 'inter-trailing
consonants' ) issues because the former is already well taken care of
by a prescription or the other suggested in the draft.

What I was questioning was why the *contraction* (that should be
applied to seuqneces of V's and T's as well as seuqnces of L's) are
mentioned only about sequences of L's. For instance, suppose that
we have three sequences LV1, LV2, and LV1V4 where V2 is a cluster of
V1 and V3 and that the desired collation among them is LV1 < LV2 (=
LV1V3) < LV1V4. Without contracting V1V4 and giving it an indepenent
primary weight larger than that of V2, they'd be sorted LV1 < LV1V4 <
LV2, instead. As a real example, consider the following three sequences.

  S1: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG O)
      U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
  S2: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+116A (ㅘ:HANGUL JUNGSEONG WA)
      U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
  S3: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG O)
      U+1163 (ㅑ:HANGUL JUNGSEONG YA) U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)

With the primary weights of each Jamo given as following,

  U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) : 301
  U+1161 (ㅏ:HANGUL JUNGSEONG A) : 201
  U+1163 (ㅑ:HANGUL JUNGSEONG YA) : 231
  U+1169 (ㅗ:HANGUL JUNGSEONG O) : 251
  U+116A (ㅘ:HANGUL JUNGSEONG WA) : 255
  U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK) : 101

their primary weight sequences will be [301,251,101], [301,255,101] and
[301,251,231,101], respectively and they'll be sorted S1 < S3 < S2 instead
of the correct S1 < S2 < S3 if there's no contraction applied to 'U+1169
(ㅗ:HANGUL JUNGSEONG O) U+1163 (ㅑ:HANGUL JUNGSEONG YA)' sequence.
By contracting 'U+1169 (ㅗ:HANGUL JUNGSEONG O) U+1163 (ㅑ:HANGUL
JUNGSEONG YA)' and giving it an independent primary weight larger than
that of U+116A (ㅘ:HANGUL JUNGSEONG WA) 255 (say, 257), they will be
sorted S1 < S2 < S3.

However, we can avoid this entirely if we just decompose the cluster vowel
'U+116A (ㅘ:HANGUL JUNGSEONG WA)' to 'U+1169 (ㅗ:HANGUL JUNGSEONG O)
U+1161 (ㅏ:HANGUL JUNGSEONG A)' and do not give the primary weight to
it. Then we have <301,251, 101>, <301, 251, 201, 101> and <301, 251,
231, 101>, which leads them to collate S1 < S2 < S3 as desired.

< a good explanation abuot *inter-syllable* issues snipped >

> > For condition B.1.a, this means that if L1 has a primary......
> >
> > I think 'L1', 'L2' and 'L1L1' have to be replaced by Li, Lj, and
> > LiLk
> > where w(Li) < w(Lj). With that change, it's clear that B.1.a. can
> > be applied to cases like the one involving U+1105 (ᄅ : HANGUL
> > CHOSEONG RIEUL), the sequence of U+1105(ᄅ : HANGUL CHOSEONG RIEUL)
> > and U+1106(ᄆ : HANGUL CHOSEONG MIEUM) [1] and U+111A(ᄚ : HANGUL
> > CHOSEONG RIEUL-HIEUH).
>
> L1, L2 are simply variables standing for particular L's; the only
> reason for that is to stress where they are equal in two different
> cases. So it is just a terminology difference from Li, Lj.

'L1L1' would be interpreted as two identical Ls in a row (doublet of
L1). My point is that they can be different as well (see the example
given above). Using 'LiLk'(or L1L3 if you prefer) instead of 'L1L1'
makes it clear, doesn't it?

> > Another missing part in my eyes is as to how to deal with U+111A(ᄚ :
> > HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105(ᄅ : HANGUL
> > CHOSEONG RIEUL) and U+1112(ᄒ: HANGUL CHOSEONG HIEUH). IMO, they
> > should be treated identically, but UTS 10(draft) is rather silent on
> > that perhaps deferring to tailorings.
>
> I agree that longer sequences should expand in weights to be
> equivalent, and that this should be done in the UCA. As I said, it is
> just taking a while working with WG20*, and in the meantime people
> need to tailor it.

Thanks again for your effort to put things into order in cooperation
with WG20 and I hope WG20 will be able to work together with the UTC
about this issue soon.

IMHO, the most elegant (not necessarily the most efficient and
sound from the engineering point of view [1]) way to do it is not
enumerating all equivalent sequences but just giving primary weights to
only 'basic' Jamos and requiring a preprocessing in which cluster jamos
are decomposed into sequences of basic Jamos. As mentioned above, in
addition to this, primary weights are assigned to satifsy the condition
that L > V > T > [syl_terminator], which is already listed in the draft.

In a sense, this preprocessing ( which is not a part of any Unicode
normalization) is similar to Thai/Lao reordering. Anyway, I'm hoping that
the normalization tailoring currently under review will be approved so
that we'll be able to represent/deal with Korean script in Unicode in
a way that is more in line with what inventors of the script envisioned
in the 15th century than we can now.

Jungshik

[1] UTS #10 can mention that if the repertoire of Hangul cluster jamos
is known a priori, the preprocessing can be avoided by a tailoring in
which all cluster jamos in the repertoir are contracted and assigned
independent primary weights that interleaves with basic Jamos. This is
rather similar to what it mentions about a possible shortcut that can be
taken about Hangul precomposed syllables when no Hangul Jamo is present
in the repertoire.

Next message: Jungshik Shin: "Re: Decimal separator with more than one character?"
Previous message: Ben Dougall: "Re: character groupings in various languages"
In reply to: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Next in thread: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Reply: Mark Davis: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 19:10:57 EDT