Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Jungshik Shin (jshin@mailaps.org)
Date: Sat May 17 2003 - 21:48:51 EDT

  • Next message: Philippe Verdy: "Re: visible glyphs for U+2062 and similar characters"

    On Sun, 18 May 2003, Philippe Verdy wrote:

    Before going further, I want to make it clear that the following
    discussion is only relevant to the scheme 1b in Mark's message sent
    2003-05-16 16:20:55 UTC -0700. With the scheme 1a, all these complexities
    and inflexibilities can be avoided, which is why I prefer 1a to 1b.

    > > From: "Jungshik Shin" <jshin@mailaps.org>
    > > > > > S1 => 3301; 2251; 1101; TERM
    > > > > > S2 => 3301; 2255; 1101; TERM
    > > > > > S3 => 3301; 2251; 2231; 1101; TERM
    > > >
    > > > > So we already have S1 < S2 < S3 appropriately.
    > > >
    > > > I must be missing something very fundamental and obvious. Could you
    > > > kindly explain to me why the above yield S1 < S2 < S3 instead of S1 <
    > > > S3 < S2? 3301, 2251, 2255, 1101, 2231 are all primary weights. The first
    > > > character have the same primary weight in S1,S2 and S3. However, in the
    > > > second character, S2 has a larger primary weight (2255) than S3(2251)
    > > > so that S3 < S2 instead of S2 < S3. Could you tell me where I went wrong
    > > > in this reasoning?
    >
    > Oh sorry, you're right this time (but it's true that there was an error
    > for the third primary weight of S3 which was 1231 instead of 2231 with
    > your initial message (probably just a typo).

      Yes, it's just a typo, but not mine but Mark's. Those weights were not
    given by me but by him. I already corrected that typo in my reply to
    his message yesterday.

    > So you're right here, to have S1<S3<S2 then the following weights are enough:
    >
    > > S1 => 3301; 2251; 1101; TERM
    > > S2 => 3301; 2255; 1101; TERM
    > > S3 => 3301; 2251; 2231; 1101; TERM
    >
    > To have S1<S2<S3, you need to tailor the group of medial V+ letters so
    > that they collate as a single unit with a custom weight for the composite
    > sequence (for example the group 2251;2231 in S3 should be collated as
    > a single 2299 weight):

       That's what I've been saying from the very beginning of this
    thread, but somehow I've been unsuccesful in getting it through to Mark.
    Not only L+ but also V+ and T+ have to be contracted and assigned
    interleaving weights. So, I began to wonder if I had missed something
    very obvious about the collation.

    > But is there a way to compute a weight for all L+ combinations in
    > that case, or does it apply only to a known repertoire of L+ sequences
    > that need a tailoring rule for special ordering ?

    > Or is what you really
    > need based on the length of L+ sequences, instead of just the individual
    > weights of medial L vowels, so that longer L+ sequences are always sorted
    > after shorter ones ?

      This inflexibility (of having to know the repertoire a priori and to
    assign them weights in such a way that they can collate as desired) is
    exactly why I don't like the scheme 1b in Mark's message. With the scheme
    1a in Mark's message (which I also suggested in my messages from the
    beginning and which Mark also prefers), we don't have to worry about this
    problem at all. All the assigned precomposed/cluster L's, V's and T's
    will be expanded [1] to sequences of collation elements. For instance,
    U+111A (Hangul Choseong RieulPieup) will be expanded to a sequence
    of collation elements made of the collation element for U+1105(Hangul
    Choseong Rieul) and that for U+1107(Hangul Choseong Pieup). This way,
    cluster Jamos, whether they're assinged codepoints or not, will be
    treated exactly the same way.

      Jungshik

    [1] my original sugegstion was to decompose them into sequences of basic
    Jamos at the preprocessing stage. Mark pointed out taht preprocessing
    on string-level is deadly for performance, (I'm wondering, then, what
    difference there is between the normalization of S1.1 in UTS #10 and
    what I suggested aside from the fact that mine is not a part of Unicode
    NFD). Anyway, the result is identical.



    This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 22:23:49 EDT