Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon May 19 2003 - 18:52:24 EDT

Next message: Allen Haaheim: "Re: Decimal separator with more than one character?"

Previous message: Andrew C. West: "RE: Decimal separator with more than one character?"
In reply to: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Next in thread: Philippe Verdy: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

My apologies; I jotted off the note quickly, and didn't read your
response carefully enough (and then I was out of town and couldn't
address this right away).

> To take the same example as I took in my previous email, I don't see
> how S1,S2 and S3 could be sorted S1 < S2 < S3 (instead of S1 < S3 <
S2)
> without contracting the sequence of 'U+1169 (ㅗ:HANGUL JUNGSEONG O)
> U+1163 (ㅑ:HANGUL JUNGSEONG YA)'?
>
> S1: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG
O)
> U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
> S2: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+116A (ㅘ:HANGUL JUNGSEONG
WA)
> U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK)
> S3: U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) U+1169 (ㅗ:HANGUL JUNGSEONG
O)
> U+1163 (ㅑ:HANGUL JUNGSEONG YA) U+11A8 (ㄱ:HANGUL JONGSEONG
KIYEOK)

Let me recap. As I said, we have strategy (a)

>> a. decompose them.
>>
>> U+111A(ᄚ HANGUL CHOSEONG RIEUL-HIEUH) => X Y
>>
...
>> Yes, (a) my preference as well, as I stated. It is more flexible,
>> since it works for any repertoire. It may or may not yield longer
sort
>> keys, depending on whether the sort keys are compressed or not (as
in
>> ICU). The issue is that a small set of characters will compress
>> better, even if the starting weight sequences are longer.

So let's look at your example, where strategy (a) is applied. There is
no need for subcluster terminators. The characters have the following
weights (this is where I blundered, since you had already
preweighted).:

  U+1100 (ᄀ:HANGUL CHOSEONG KIYEOK) : 301
  U+1161 (ㅏ:HANGUL JUNGSEONG A) : 201
  U+1163 (ㅑ:HANGUL JUNGSEONG YA) : 231
  U+1169 (ㅗ:HANGUL JUNGSEONG O) : 251
  U+116A (ㅘ:HANGUL JUNGSEONG WA) : 255
  U+11A8 (ㄱ:HANGUL JONGSEONG KIYEOK) : 101

The goal is: S1 < S2 < S3

Applying strategy (a), if the weight for WA is expanded (as per the
old compat mappings in UnicodeData-2.0.txt, where 116A => 1169 1161),
then we get:

U+116A (ㅘ:HANGUL JUNGSEONG WA) : 251, 201*

[*Now, one may want the character to expand to a sequence that is
primary, secondary, or tertiary different, but for now I'll just
assume that identity is ok.]

You then get the following ordering.

S1: 301, 251, 101, TERM
S2: 301, 251, 201, 101, TERM
S3: 301, 251, 231, 101, TERM

In many circumstances one has the option of expanding one character
(in collation weights) or contracting other characters. We have to
look at the combinatorics to see which is better.

What I think did not come through in my previous messages is that the
only difference between (a) and (b) is in their treatement of L
sequences: both expand weights for V's and T's.

The downside of (b) is that one has to have a known repertoire of L
sequences, those that figure into contractions.

Mark

P.S. My email at this address is not working well, so I can't read
some of the recent messages and may not get responses to this right
away.

Next message: Allen Haaheim: "Re: Decimal separator with more than one character?"
Previous message: Andrew C. West: "RE: Decimal separator with more than one character?"
In reply to: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Next in thread: Philippe Verdy: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 19 2003 - 19:32:00 EDT