Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Mark Davis (mark.davis@jtcsv.com)
Date: Fri May 16 2003 - 19:20:55 EDT

Next message: Kenneth Whistler: "Re: character groupings in various languages"

Previous message: Ben Dougall: "Re: character groupings in various languages"
In reply to: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Next in thread: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Reply: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

The following message goes in lots of different directions, so let me
try to summarize here.

1. For the "precomposed" jamos, there are two solutions.

Suppose we have:

U+1105(ᄅ) HANGULCHOSEONG RIEUL) => X
U+1112(ᄒ: HANGUL CHOSEONG HIEUH) => Y

a. decompose them.

U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X Y

b. interleave them and treat their constitutent sequences as
contractions.

U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X'
U+1105(ᄅ) HANGULCHOSEONG RIEUL), U+1112(ᄒ: HANGUL CHOSEONG HIEUH)
=> X'

For b, you need to do both steps, so that you get the same results
either way.

> What I don't like is the inflexibility of having to collect all the
> known occurrence of cluster Jamos and giving each of them the
primary
> weight in such a way (interleaving) that they can get collated the
way
> expected by (South) Koreans

Yes, (a) my preference as well, as I stated. It is more flexible,
since it works for any repertoire. It may or may not yield longer sort
keys, depending on whether the sort keys are compressed or not (as in
ICU). The issue is that a small set of characters will compress
better, even if the starting weight sequences are longer.

>With T < V < L, why would we need to terminate L+, V+ and T+
separately
> instead of just 'L+V+T*' as a whole? Or am I missing something
obvious?

2. No, that is the purpose of the ordering of those weights. For the
syllable weighting, if you order T < V < L, you don't have to
terminate a sequence of T's or a sequence of V's or a sequence of L's,
just the whole syllable.

3. You don't need to terminate the whole syllable, IF you are willing
to have all sequences of L's contract (as in 1b). What you do there is
contract all Ls, and then weight L <... TRAILING_WEIGHT < T < V. In
that case, all syllables are of the form L V+ T*. For any sequences of
V and T, the ordering works. Any L or nonJamo will sort less than T
and V, so

L V1 X
L V1 V

for any L, V1, V, X = non-Jamo or L. Same with T. However, the
disadvantage of this approach is that you do have to contract all of
the possible sequences of L's that you care about.

You don't have to contract sequences of T's or V's, since those are
taken care of by the T < V weighting. E.g.

L V1 X
L V1 T
L V1 V

> enumerating all equivalent sequences but just giving primary weights
to
> only 'basic' Jamos and requiring a preprocessing in which cluster
jamos
> are decomposed into sequences of basic Jamos.

Preprocessing (on a string basis) is *deadly* for performance. It is
also not necessary. The weight tables already allow characters to
expand, that is what would be done in this case: it is just 1a above.

Märk Dāvĭs
________
mark.davis@jtcsv.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

----- Original Message -----
From: "Jungshik Shin" <jshin@mailaps.org>
To: "Unicode Mailing List" <unicode@unicode.org>
Sent: Friday, May 16, 2003 15:33
Subject: Re: Proposed Update of UTS #10: Unicode Collation Algorithm

> On Mon, 12 May 2003, Mark Davis wrote:
>
> Thank you for your detailed reply.
>
> > > are not listed in 7.1.4). However, it seems to me that some
of
> > these
> > > customizations/tailoring in 7.1.4 are not necessary if an
> > additional step
> > > of preprocssing (in which clusters jamos are decomposed into
> > sequences of
> > > basic jamos) is taken as was proposed by Kent in his paper in
> > 2001-2002.
> >
> > It is also a question of cost. Rearranging the weights so that T <
V <
> > L doesn't cost anything in implementations of the algorithm.
>
> Yes, I'm also thinking in terms of cost and flexibility. I have
> no objection to rearranging the weights so that T < V < L and didn't
> express any in my previous message. That's a very good idea.
>
> What I don't like is the inflexibility of having to collect all
the
> known occurrence of cluster Jamos and giving each of them the
primary
> weight in such a way (interleaving) that they can get collated the
way
> expected by (South) Koreans. When a new cluster jamo is added to
the
> repertoire, it's likely that tailoring has to be made again. It
wouldn't
> cost anything at the run-time, but it costs something to retailor
> them. Because it's rare that we have to add new clusters, this may
not
> be a realistic concern. Still I find it rather inelegant and not in
line
> with the basic principles of Korean script that its inventhors had
in
> mind.
>
>
> > Terminating each of the subclusters would.
>
> With T < V < L, why would we need to terminate L+, V+ and T+
separately
> instead of just 'L+V+T*' as a whole? Or am I missing something
obvious?
>
> > > As for condition B.2 in 7.1.4, an alternative to that is just
adding
> > > a terminator primary weight to only Hangul syllables without
optional
> > > T('s). This terminator primary weight should be less than the
primary
> > > weight for any Ts (and that of any V's and Ls by condition A.)
>
> I was wrong. Any syllable, with or without optional T('s), has
> to be terminated.
>
> > > As for condition B.1.a, I'm wondering why only L's are
mentioned.
> > > The same (contraction) should be applied to multiple V's and T's
as well.
> > > In addition, in the paragraph that begins with
> >
> > 2. The same goes for:
> >
> > L V T
> > L V V
> >
> > With all V's greater than all T's, then any sequences that are
equal
> > up to the T/V comparison will take the right ordering.
>
> Well, I might not have been very clear that I wasn't so much
> concerned with the handling of inter-syllble (or Hangul syllable
followed
> by non-Hangul) issue as with intra-syllable (or more precisely,
> 'inter-vowel', 'inter-leading consonants', and 'inter-trailing
> consonants' ) issues because the former is already well taken care
of
> by a prescription or the other suggested in the draft.
>
> What I was questioning was why the *contraction* (that should be
> applied to seuqneces of V's and T's as well as seuqnces of L's) are
> mentioned only about sequences of L's. For instance, suppose that
> we have three sequences LV1, LV2, and LV1V4 where V2 is a cluster of
> V1 and V3 and that the desired collation among them is LV1 < LV2 (=
> LV1V3) < LV1V4. Without contracting V1V4 and giving it an
indepenent
> primary weight larger than that of V2, they'd be sorted LV1 < LV1V4
<
> LV2, instead. As a real example, consider the following three
sequences.
>
> S1: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG
O)
> U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
> S2: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+116A (ㅘ:HANGUL JUNGSEONG
WA)
> U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
> S3: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG
O)
> U+1163 (ㅑ:HANGUL JUNGSEONG YA) U+11A8 (ㄱ:HANGUL JONGSEONG
KIYEOK)
>
> With the primary weights of each Jamo given as following,
>
> U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) : 301
> U+1161 (ㅏ:HANGUL JUNGSEONG A) : 201
> U+1163 (ㅑ:HANGUL JUNGSEONG YA) : 231
> U+1169 (ㅗ:HANGUL JUNGSEONG O) : 251
> U+116A (ㅘ:HANGUL JUNGSEONG WA) : 255
> U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK) : 101
>
> their primary weight sequences will be [301,251,101], [301,255,101]
and
> [301,251,231,101], respectively and they'll be sorted S1 < S3 < S2
instead
> of the correct S1 < S2 < S3 if there's no contraction applied to
'U+1169
> (ㅗ:HANGUL JUNGSEONG O) U+1163 (ㅑ:HANGUL JUNGSEONG YA)' sequence.
> By contracting 'U+1169 (ㅗ:HANGUL JUNGSEONG O) U+1163 (ㅑ:HANGUL
> JUNGSEONG YA)' and giving it an independent primary weight larger
than
> that of U+116A (ㅘ:HANGUL JUNGSEONG WA) 255 (say, 257), they will be
> sorted S1 < S2 < S3.
>
> However, we can avoid this entirely if we just decompose the cluster
vowel
> 'U+116A (ㅘ:HANGUL JUNGSEONG WA)' to 'U+1169 (ㅗ:HANGUL JUNGSEONG O)
> U+1161 (ㅏ:HANGUL JUNGSEONG A)' and do not give the primary weight to
> it. Then we have <301,251, 101>, <301, 251, 201, 101> and <301, 251,
> 231, 101>, which leads them to collate S1 < S2 < S3 as desired.
>
>
> < a good explanation abuot *inter-syllable* issues snipped >
>
> > > For condition B.1.a, this means that if L1 has a primary......
> > >
> > > I think 'L1', 'L2' and 'L1L1' have to be replaced by Li, Lj, and
> > > LiLk
> > > where w(Li) < w(Lj). With that change, it's clear that B.1.a.
can
> > > be applied to cases like the one involving U+1105 (ᄅ : HANGUL
> > > CHOSEONG RIEUL), the sequence of U+1105(ᄅ : HANGUL CHOSEONG
RIEUL)
> > > and U+1106(ᄆ : HANGUL CHOSEONG MIEUM) [1] and U+111A(ᄚ : HANGUL
> > > CHOSEONG RIEUL-HIEUH).
> >
> > L1, L2 are simply variables standing for particular L's; the only
> > reason for that is to stress where they are equal in two different
> > cases. So it is just a terminology difference from Li, Lj.
>
> 'L1L1' would be interpreted as two identical Ls in a row (doublet
of
> L1). My point is that they can be different as well (see the example
> given above). Using 'LiLk'(or L1L3 if you prefer) instead of 'L1L1'
> makes it clear, doesn't it?
>
> > > Another missing part in my eyes is as to how to deal with
U+111A(ᄚ :
> > > HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105(ᄅ :
HANGUL
> > > CHOSEONG RIEUL) and U+1112(ᄒ: HANGUL CHOSEONG HIEUH). IMO, they
> > > should be treated identically, but UTS 10(draft) is rather
silent on
> > > that perhaps deferring to tailorings.
> >
> > I agree that longer sequences should expand in weights to be
> > equivalent, and that this should be done in the UCA. As I said, it
is
> > just taking a while working with WG20*, and in the meantime people
> > need to tailor it.
>
> Thanks again for your effort to put things into order in
cooperation
> with WG20 and I hope WG20 will be able to work together with the UTC
> about this issue soon.
>
> IMHO, the most elegant (not necessarily the most efficient and
> sound from the engineering point of view [1]) way to do it is not
> enumerating all equivalent sequences but just giving primary weights
to
> only 'basic' Jamos and requiring a preprocessing in which cluster
jamos
> are decomposed into sequences of basic Jamos. As mentioned above,
in
> addition to this, primary weights are assigned to satifsy the
condition
> that L > V > T > [syl_terminator], which is already listed in the
draft.
>
> In a sense, this preprocessing ( which is not a part of any
Unicode
> normalization) is similar to Thai/Lao reordering. Anyway, I'm hoping
that
> the normalization tailoring currently under review will be approved
so
> that we'll be able to represent/deal with Korean script in Unicode
in
> a way that is more in line with what inventors of the script
envisioned
> in the 15th century than we can now.
>
> Jungshik
>
>
> [1] UTS #10 can mention that if the repertoire of Hangul cluster
jamos
> is known a priori, the preprocessing can be avoided by a tailoring
in
> which all cluster jamos in the repertoir are contracted and
assigned
> independent primary weights that interleaves with basic Jamos. This
is
> rather similar to what it mentions about a possible shortcut that
can be
> taken about Hangul precomposed syllables when no Hangul Jamo is
present
> in the repertoire.
>
>
>

Next message: Kenneth Whistler: "Re: character groupings in various languages"
Previous message: Ben Dougall: "Re: character groupings in various languages"
In reply to: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Next in thread: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Reply: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 20:06:58 EDT