Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

From: Peter Kirk (
Date: Tue Aug 19 2003 - 18:11:23 EDT

  • Next message: Michael Everson: "Re: Last Resort Font"

    Resending with the correct address...

    On 19/08/2003 14:23, Mark Davis wrote:

    >Three points.
    >First, While we try to make the the UCA collation table (DUCET) as reasonable as
    >possible for the main languages of a given script, it is not guaranteed to
    >produce the correct sorting for any particular language. The UCA *is* designed
    >so that it provides a default base ordering for all of Unicode, and individual
    >languages can be given tailorings of the DUCET that handle the specifics of
    >their string comparison requirements.
    >Thus if there are changes that improve the handling of the UCA for the major
    >languages using a given script, and do not destabilize others, those are
    >candidates for change in a version. For example, if it turned out that a
    >particular Tamil character (or sequence of characters!) was not sorted correctly
    >according to the DUCET (e.g. on,
    >then it would be a candidate, and should be submitted on the form.
    Understood. On this basis, the DUCET sorting for the Hebrew block should
    be based on the requirements for modern Hebrew, with Yiddish, Ladino etc
    also being taken into acount.

    >Second, we do and should favor modern language communities when making
    >incompatible tradeoffs. So if we have the choice between making French sort
    >correctly without tailoring, or have Latin sort correctly without tailoring, we
    >should choose the modern community. The Latin community can always use a
    >tailored UCA, in any event.
    Understood. I accept the primacy of the modern language in this case.
    There may be some issues on which the modern language has no
    preference, especially for characters only used in older Hebrew, and in
    such cases it would make sense to follow the preferences of ancient
    Hebrew scholars. If it becomes necessary to use a tailored UCA for
    biblical work, so be it, but I would prefer not to. We have come close
    to having to use a separate set of vowels for biblical Hebrew simply
    because decisions were rushed and then frozen on the basis of modern
    Hebrew requirements. I don't want any danger of falling into the same
    kind of trap with collation.

    >Third, there is often a serious confusion between sorting weight and canonical
    >ordering. The fact that a grave accent precedes a cedilla in canonical order is
    >*completely independent of* whatever collation weights each of them has, either
    >in a tailoring or in the DUCET. The only substantive issue is how each of these
    >sorts separately or in combination. And making the combination (sequence) of
    >grave and cedilla sort before grave, after grave, before cedilla, or after
    >cedilla are all possible; all of those can be handled by the UCA as
    >contractions. See for more
    Yes, I understand that the collation weights are quite independent of
    the canonical combining classes. But collation does become trickier
    when the canonical ordering is not the expected one, because of the
    assumption that collation is based on the order of the string i.e. based
    on the first character, then the second etc.

    Well, I am glad that contractions provide a way around that problem. So
    perhaps we ought to be looking at using them for Hebrew in DUCET. I
    guess we should consider defining contractions for each case of
    <consonant, dagesh> which differ from the consonant at the second level
    only, perhaps also the same for rafe, and similarly for each combination
    of shin, shin/sin dot and dagesh. The problem comes that the vowels
    intrude between the consonant and the dagesh, and meteg comes before
    shin/sin dot, so there is a potential need for a rather large number of
    contractions, especially if we consider a shin with a right meteg which
    might come out as:

    <shin, dagesh, meteg, CGJ, {any one of 11 vowels}, {optional shin dot |
    sin dot}, masora circle>

    with the CGJ inhibiting complete canonical reordering, and the shin/sin
    dot must be contracted with the shin.

    Perhaps we need to specify that dagesh and shin/sin dot must always come
    BEFORE any CGJ in such combinations so that they don't get separated too
    far from the base character. In fact I think I will change my document
    to specify that.

    PS Is there a problem with the Unicode Hebrew list? Nothing seems to
    have appeared on it today, including my previous posting on this thread
    and Mark's reply to it.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Tue Aug 19 2003 - 18:54:20 EDT