Re: Proposed Update of UTS #10: Unicode Collation Algorithm

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 17 2003 - 19:03:52 EDT

  • Next message: Doug Ewell: "Re: 8-bit encodings and ASCII (was: Unicode conformant character encodings and us-ascii)"

    From: "Jungshik Shin" <jshin@mailaps.org>
    > > > S1 => 3301; 2251; 1101; TERM
    > > > S2 => 3301; 2255; 1101; TERM
    > > > S3 => 3301; 2251; 2231; 1101; TERM
    >
    > > So we already have S1 < S2 < S3 appropriately.
    >
    > I must be missing something very fundamental and obvious. Could you
    > kindly explain to me why the above yield S1 < S2 < S3 instead of S1 <
    > S3 < S2? 3301, 2251, 2255, 1101, 2231 are all primary weights. The first
    > character have the same primary weight in S1,S2 and S3. However, in the
    > second character, S2 has a larger primary weight (2255) than S3(2251)
    > so that S3 < S2 instead of S2 < S3. Could you tell me where I went wrong
    > in this reasoning?

    Oh sorry, you're right this time (but it's true that there was an error for the third primary weight of S3 which was 1231 instead of 2231 with your initial message (probably just a typo).

    So you're right here, to have S1<S2<S3 we must have at least the following weights:

    S1 => 3301; 2251; 1101; TERM
    S2 => 3301; 2255; TERM; 1101; TERM
    S3 => 3301; 2251; 2231; 1101; TERM

    As this exception is nearly impossible to generate optimally, it seems more simple to just add TERM after T+ and after L+.
    But I doubt that a TERM is needed after V+:

    S1 => 3301; TERM; 2251; TERM; 1101
    S2 => 3301; TERM; 2255; TERM; 1101
    S3 => 3301; TERM; 2251; 2231; TERM; 1101

    Also I doubt you need then to add other constants +1000, +2000, +3000 in this case, because your initial primary weights are already so that L1(T) > L1(L) > L1(V) > TERM:

    S1 => 301; TERM; 251; TERM; 101
    S2 => 301; TERM; 255; TERM; 101
    S3 => 301; TERM; 251; 231; TERM; 101



    This archive was generated by hypermail 2.1.5 : Sat May 17 2003 - 19:48:32 EDT