Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Jungshik Shin (jshin@mailaps.org)
Date: Sat May 17 2003 - 21:48:51 EDT

• Next message: Philippe Verdy: "Re: visible glyphs for U+2062 and similar characters"

On Sun, 18 May 2003, Philippe Verdy wrote:

Before going further, I want to make it clear that the following
discussion is only relevant to the scheme 1b in Mark's message sent
2003-05-16 16:20:55 UTC -0700. With the scheme 1a, all these complexities
and inflexibilities can be avoided, which is why I prefer 1a to 1b.

> > From: "Jungshik Shin" <jshin@mailaps.org>
> > > > > S1 => 3301; 2251; 1101; TERM
> > > > > S2 => 3301; 2255; 1101; TERM
> > > > > S3 => 3301; 2251; 2231; 1101; TERM
> > >
> > > > So we already have S1 < S2 < S3 appropriately.
> > >
> > > I must be missing something very fundamental and obvious. Could you
> > > kindly explain to me why the above yield S1 < S2 < S3 instead of S1 <
> > > S3 < S2? 3301, 2251, 2255, 1101, 2231 are all primary weights. The first
> > > character have the same primary weight in S1,S2 and S3. However, in the
> > > second character, S2 has a larger primary weight (2255) than S3(2251)
> > > so that S3 < S2 instead of S2 < S3. Could you tell me where I went wrong
> > > in this reasoning?
>
> Oh sorry, you're right this time (but it's true that there was an error
> for the third primary weight of S3 which was 1231 instead of 2231 with
> your initial message (probably just a typo).

Yes, it's just a typo, but not mine but Mark's. Those weights were not
given by me but by him. I already corrected that typo in my reply to
his message yesterday.

> So you're right here, to have S1<S3<S2 then the following weights are enough:
>
> > S1 => 3301; 2251; 1101; TERM
> > S2 => 3301; 2255; 1101; TERM
> > S3 => 3301; 2251; 2231; 1101; TERM
>
> To have S1<S2<S3, you need to tailor the group of medial V+ letters so
> that they collate as a single unit with a custom weight for the composite
> sequence (for example the group 2251;2231 in S3 should be collated as
> a single 2299 weight):

That's what I've been saying from the very beginning of this
thread, but somehow I've been unsuccesful in getting it through to Mark.
Not only L+ but also V+ and T+ have to be contracted and assigned
interleaving weights. So, I began to wonder if I had missed something

> But is there a way to compute a weight for all L+ combinations in
> that case, or does it apply only to a known repertoire of L+ sequences
> that need a tailoring rule for special ordering ?

> Or is what you really
> need based on the length of L+ sequences, instead of just the individual
> weights of medial L vowels, so that longer L+ sequences are always sorted
> after shorter ones ?

This inflexibility (of having to know the repertoire a priori and to
assign them weights in such a way that they can collate as desired) is
exactly why I don't like the scheme 1b in Mark's message. With the scheme
1a in Mark's message (which I also suggested in my messages from the
problem at all. All the assigned precomposed/cluster L's, V's and T's
will be expanded [1] to sequences of collation elements. For instance,
U+111A (Hangul Choseong RieulPieup) will be expanded to a sequence
of collation elements made of the collation element for U+1105(Hangul
Choseong Rieul) and that for U+1107(Hangul Choseong Pieup). This way,
cluster Jamos, whether they're assinged codepoints or not, will be
treated exactly the same way.

Jungshik

[1] my original sugegstion was to decompose them into sequences of basic
Jamos at the preprocessing stage. Mark pointed out taht preprocessing
on string-level is deadly for performance, (I'm wondering, then, what
difference there is between the normalization of S1.1 in UTS #10 and
what I suggested aside from the fact that mine is not a part of Unicode
NFD). Anyway, the result is identical.

This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 22:23:49 EDT