Re: Phoenician

From: Philippe Verdy (
Date: Fri May 07 2004 - 19:24:58 CDT

  • Next message: Mark Davis: "Re: TR35 (was: Standardize TimeZone ID"

    From: "Philippe Verdy" <>
    > To make things simpler, introduce a special collation key value which is lower
    > than all others, (0 in the example above), and you get a simpler view of
    > collation elements as a single vector of numeric value, if you use it as a
    > terminator between each level for the resulting collation string:
    > "aa" => (1, 1, 0, 10, 10, 0)
    > "AB" => (1, 2, 0, 11, 11, 0)
    > "Ab" => (1, 2, 0, 11, 10, 0)
    > "Aba" => (1, 2, 2, 0, 11, 10, 10, 0)
    > This simplifies things to get binary comparable vectors of numeric values. The
    > length of the vector depends on the length (in characters or collation
    > of input strings, and on the number of levels considered.

    Note that DUCET uses another solution: no separator is used, but instead all
    primary weights are made higher than all secondary weights. If you read the
    Unicode collation algorithm, you'll see that the value 0 is used to mean
    "ignorable at that level", so that it can be suppressed for the collation keys
    generated from input strings.

    I should have better used the values already assigned in the current version of
    DUCET... Note however that these values are arbitrary, only their relative order
    is important.

    Also this algorithm is a bit more complex, because it allows tailoring the order
    with which weights in a given level are generated in the output collation key
    (they can be output in forward or backward order, notably for French ordering at
    level 2 for accents), and because it takes into account not only characters but
    also groups of Unicode characters as single units for collation (for example
    digraphs used in many languages which sort them as if they were one letter, such
    as in Spanish, or the recently discussed "gb" in Yoruba).

    This archive was generated by hypermail 2.1.5 : Fri May 07 2004 - 19:25:31 CDT