Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 17 2003 - 20:05:02 EDT

  • Next message: Jungshik Shin: "Re: Proposed Update of UTS #10: Unicode Collation Algorithm"

    I must be stupid (or too drunk), as I made other errors in replying...

    > From: "Jungshik Shin" <jshin@mailaps.org>
    > > > > S1 => 3301; 2251; 1101; TERM
    > > > > S2 => 3301; 2255; 1101; TERM
    > > > > S3 => 3301; 2251; 2231; 1101; TERM
    > >
    > > > So we already have S1 < S2 < S3 appropriately.
    > >
    > > I must be missing something very fundamental and obvious. Could you
    > > kindly explain to me why the above yield S1 < S2 < S3 instead of S1 <
    > > S3 < S2? 3301, 2251, 2255, 1101, 2231 are all primary weights. The first
    > > character have the same primary weight in S1,S2 and S3. However, in the
    > > second character, S2 has a larger primary weight (2255) than S3(2251)
    > > so that S3 < S2 instead of S2 < S3. Could you tell me where I went wrong
    > > in this reasoning?

    Oh sorry, you're right this time (but it's true that there was an error for the third primary weight of S3 which was 1231 instead of 2231 with your initial message (probably just a typo).

    So you're right here, to have S1<S3<S2 then the following weights are enough:

    > S1 => 3301; 2251; 1101; TERM
    > S2 => 3301; 2255; 1101; TERM
    > S3 => 3301; 2251; 2231; 1101; TERM

    To have S1<S2<S3, you need to tailor the group of medial V+ letters so that they collate as a single unit with a custom weight for the composite sequence (for example the group 2251;2231 in S3 should be collated as a single 2299 weight):

    S1 => 3301; 2251; 1101; TERM
    S2 => 3301; 2255; 1101; TERM
    S3 => 3301; 2299; 1101; TERM

    But is there a way to compute a weight for all L+ combinations in that case, or does it apply only to a known repertoire of L+ sequences that need a tailoring rule for special ordering ? Or is what you really need based on the length of L+ sequences, instead of just the individual weights of medial L vowels, so that longer L+ sequences are always sorted after shorter ones ?
    If so, you need a leading weight before the L+ sequence that specifies this length, and this requires another set of weights offsets, so that Weight(length(V+)) is offset by 2000 too:

    S1 => 3301; 2001; 2251; 1101; TERM
    S2 => 3301; 2001; 2255; 1101; TERM
    S3 => 3301; 2002; 2251; 2231; 1101; TERM

    This will work even in non standard syllables like L+T+ (we still insert this length-specifier weight), for example here where L+T+ non standard syllables sort before L+V+T*:
    S4 => 3301; 2000; 1101; TERM



    This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 20:45:49 EDT