Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon May 12 2003 - 16:52:03 EDT

• Next message: jameskass@att.net: "Re: visible glyphs for U+2062 and similar characters"

> are not listed in 7.1.4). However, it seems to me that some of
these
> customizations/tailoring in 7.1.4 are not necessary if an
> of preprocssing (in which clusters jamos are decomposed into
sequences of
> basic jamos) is taken as was proposed by Kent in his paper in
2001-2002.

It is also a question of cost. Rearranging the weights so that T < V <
L doesn't cost anything in implementations of the algorithm.
Terminating each of the subclusters would.

> As for condition B.2 in 7.1.4, an alternative to that is just adding
> a terminator primary weight to only Hangul syllables without
optional
> T('s). This terminator primary weight should be less than the
primary
> weight for any Ts (and that of any V's and Ls by condition A.)
>
> As for condition B.1.a, I'm wondering why only L's are mentioned.
> The same (contraction) should be applied to multiple V's and T's as
well.
> In addition, in the paragraph that begins with

1. If you reorder all T < V < L, then when you get a sequence:

L V
L L

and the L's are equal, then the second is always greater.

2. The same goes for:

L V T
L V V

With all V's greater than all T's, then any sequences that are equal
up to the T/V comparison will take the right ordering.

3. The problem is then only with sequences like:

L V X
L V T

If X is not a Jamo, or starts a new syllable, then you have to make
sure that X is always less than T. There are two ways to do this:

3a. terminate every syllable.
3b. make V & T higher than all X (including L).

#3b involves reversing #1; instead of ordering L higher, you have to
make it lower. In that case, you have to have all multi-L sequence
contract, in order to get the right ordering.

I originally favored #3b, but after considering the different factors,
now believe that 3a is better overall.

>
> For condition B.1.a, this means that if L1 has a primary......
>
> I think 'L1', 'L2' and 'L1L1' have to be replaced by Li, Lj, and
LiLk
> where w(Li) < w(Lj). With that change, it's clear that B.1.a. can
> be applied to cases like the one involving U+1105 (ᄅ : HANGUL
> CHOSEONG RIEUL), the sequence of U+1105(ᄅ : HANGUL CHOSEONG RIEUL)
> and U+1106(ᄆ : HANGUL CHOSEONG MIEUM) [1] and U+111A(ᄚ : HANGUL
> CHOSEONG RIEUL-HIEUH).

L1, L2 are simply variables standing for particular L's; the only
reason for that is to stress where they are equal in two different
cases. So it is just a terminology difference from Li, Lj.

>
> Another missing part in my eyes is as to how to deal with U+111A(ᄚ :
> HANGUL CHOSEONG RIEUL-HIEUH) and the sequence of U+1105(ᄅ : HANGUL
> CHOSEONG RIEUL) and U+1112(ᄒ: HANGUL CHOSEONG HIEUH). IMO, they
> should be treated identically, but UTS 10(draft) is rather silent on
> that perhaps deferring to tailorings.

I agree that longer sequences should expand in weights to be
equivalent, and that this should be done in the UCA. As I said, it is
just taking a while working with WG20*, and in the meantime people
need to tailor it.

>
>
> > Thanks for bringing this interleaving issue up; we should add a
> > description to section 7.1.4.
>
> That will be nice.
>
>
> [1] I'm not making up these sequences. MS Office XP and Uniscribe
support
> this sequence (see
> http://www.microsoft.com/typography/otfntdev/hangulot/appen.htm).
> PARK Won-kyu with my help also has developed a GPL'd opentype font
> that supports this sequence along with many others (and will release
> a few more). There's a Mozilla patch to support them across
platforms
> and Pango patch was/is being made.

We know that: see (*) above.

This archive was generated by hypermail 2.1.5 : Mon May 12 2003 - 17:42:45 EDT