Re: IJ joint in spaced lettering

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Jan 09 2006 - 20:13:32 CST

  • Next message: Kenneth Whistler: "Re: IJ joint in spaced lettering"

    ----- Original Message -----
    From: "Kenneth Whistler" <kenw@sybase.com>
    To: <asmusf@ix.netcom.com>
    Cc: <unicode@unicode.org>; <kenw@sybase.com>
    Sent: Tuesday, January 10, 2006 1:38 AM
    Subject: Re: IJ joint in spaced lettering

    > Asmus noted:
    >
    >> (For example, I assume,
    >> but have not verified, that i+j and ij in fact sort the same in the DUCET).
    >
    > 0049 ; [.103C.0020.0008.0049] # LATIN CAPITAL LETTER I
    > 004A ; [.1054.0020.0008.004A] # LATIN CAPITAL LETTER J
    > 0069 ; [.103C.0020.0002.0069] # LATIN SMALL LETTER I
    > 006A ; [.1054.0020.0002.006A] # LATIN SMALL LETTER J
    > 0132 ; [.103C.0020.000A.0132][.1054.0020.000A.0132] # LATIN CAPITAL LIGATURE IJ; QQKN
    > 0133 ; [.103C.0020.0004.0133][.1054.0020.0004.0133] # LATIN SMALL LIGATURE IJ; QQKN
    >
    >
    > <0069, 006A> --> 103C.1054.0020.0020.0002.0002
    > <0133> --> 103C.1054.0020.0020.0004.0004
    > <0049, 004A> --> 103C.1054.0020.0020.0008.0008
    > <0132> --> 103C.1054.0020.0020.000A.000A
    > ^^^^^^^^^ ^^^^^^^^^ ^^^^^^^^^
    > primary secondary tertiary

    Should n't it be instead (leading zeroes suppressed only for clarity, avoiding line breaking in emails) ?:

    <0069, 006A>
                --> 103C.1054.0.20.20.0.2.2.0.69.6A
    <0133>
                --> 103C.1054.0.20.20.0.4.4.0.133
    <0049, 004A>
                --> 103C.1054.0.20.20.0.8.8.0.49.4A
    <0132>
                --> 103C.1054.0.20.20.0.A.A.0.132

    (note the addition of .0. to separate collation levels, to allow binary sort order, and the addition of the trailing collation level for the default codepoint ordering with unlimited collation keys)

    I'd like to know if there's a standard "encoding" defined to represent collation keys computed from comparable strings (this may have applications in database indexes to speed up searches or ordering of the selection, but may be the absolute values collation elements is not defined in the standard, as well as the magnitude of their relative differences, due to possible tailoring, so this representation remains opaque in UCA implementations, which may even encode each collation levels with variable bit-sizes to compress these keys, notably for the third collation level, and for the last level based for example on UTF-8 or some UTF-16-like encoding where surrogates are placed at end of the encoding space).

    Another related question: Why isn't there a standard 16-bit UTF that preserves the binary ordering of codepoints? (I mean for example UTF-16 modified simply by moving all code units or code points in E000..FFFF down to D800..F7FF and moving surrogate code units in D800..DFFF up to F800..FFFF).



    This archive was generated by hypermail 2.1.5 : Mon Jan 09 2006 - 20:15:59 CST