Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

From: Mark Davis (
Date: Tue Aug 19 2003 - 18:22:01 EDT

  • Next message: Don Osborn: "Re: Breaking free from UNICODE"

    I forgot the most important point of all:

    The goal for UCA 4.0 is to top it up to the Unicode 4.0 repertoire. The
    timeframe for that is quite short -- it was to have been done some time ago --
    and we don't want to make any changes that we would want to pull out later when
    we work with SC22/WG20. So we will only make "safe and obvious" changes in this

    Of course, you should still continue to work on any more extensive comments for
    a later version, so that they are prepared well in advance; after all, all of
    these issues are on collation features that have been in since 3.1 and before!

    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Peter Kirk" <>
    To: "Mark Davis" <>
    Cc: "Matitiahu Allouche" <>; <>;
    <>; <>; "Joan Wardell" <>;
    Sent: Tuesday, August 19, 2003 14:55
    Subject: Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

    > On 19/08/2003 14:23, Mark Davis wrote:
    > >Three points.
    > >
    > >First, While we try to make the the UCA collation table (DUCET) as reasonable
    > >possible for the main languages of a given script, it is not guaranteed to
    > >produce the correct sorting for any particular language. The UCA *is*
    > >so that it provides a default base ordering for all of Unicode, and
    > >languages can be given tailorings of the DUCET that handle the specifics of
    > >their string comparison requirements.
    > >
    > >Thus if there are changes that improve the handling of the UCA for the major
    > >languages using a given script, and do not destabilize others, those are
    > >candidates for change in a version. For example, if it turned out that a
    > >particular Tamil character (or sequence of characters!) was not sorted
    > >according to the DUCET (e.g. on,
    > >then it would be a candidate, and should be submitted on the form.
    > >
    > >
    > Understood. On this basis, the DUCET sorting for the Hebrew block should
    > be based on the requirements for modern Hebrew, with Yiddish, Ladino etc
    > also being taken into acount.
    > >Second, we do and should favor modern language communities when making
    > >incompatible tradeoffs. So if we have the choice between making French sort
    > >correctly without tailoring, or have Latin sort correctly without tailoring,
    > >should choose the modern community. The Latin community can always use a
    > >tailored UCA, in any event.
    > >
    > >
    > Understood. I accept the primacy of the modern language in this case.
    > There may be some issues on which the modern language has no
    > preference, especially for characters only used in older Hebrew, and in
    > such cases it would make sense to follow the preferences of ancient
    > Hebrew scholars. If it becomes necessary to use a tailored UCA for
    > biblical work, so be it, but I would prefer not to. We have come close
    > to having to use a separate set of vowels for biblical Hebrew simply
    > because decisions were rushed and then frozen on the basis of modern
    > Hebrew requirements. I don't want any danger of falling into the same
    > kind of trap with collation.
    > >Third, there is often a serious confusion between sorting weight and
    > >ordering. The fact that a grave accent precedes a cedilla in canonical order
    > >*completely independent of* whatever collation weights each of them has,
    > >in a tailoring or in the DUCET. The only substantive issue is how each of
    > >sorts separately or in combination. And making the combination (sequence) of
    > >grave and cedilla sort before grave, after grave, before cedilla, or after
    > >cedilla are all possible; all of those can be handled by the UCA as
    > >contractions. See for more
    > >information.
    > >
    > >
    > Yes, I understand that the collation weights are quite independent of
    > the canonical combining classes. But collation does become trickier
    > when the canonical ordering is not the expected one, because of the
    > assumption that collation is based on the order of the string i.e. based
    > on the first character, then the second etc.
    > Well, I am glad that contractions provide a way around that problem. So
    > perhaps we ought to be looking at using them for Hebrew in DUCET. I
    > guess we should consider defining contractions for each case of
    > <consonant, dagesh> which differ from the consonant at the second level
    > only, perhaps also the same for rafe, and similarly for each combination
    > of shin, shin/sin dot and dagesh. The problem comes that the vowels
    > intrude between the consonant and the dagesh, and meteg comes before
    > shin/sin dot, so there is a potential need for a rather large number of
    > contractions, especially if we consider a shin with a right meteg which
    > might come out as:
    > <shin, dagesh, meteg, CGJ, {any one of 11 vowels}, {optional shin dot |
    > sin dot}, masora circle>
    > with the CGJ inhibiting complete canonical reordering, and the shin/sin
    > dot must be contracted with the shin.
    > Perhaps we need to specify that dagesh and shin/sin dot must always come
    > BEFORE any CGJ in such combinations so that they don't get separated too
    > far from the base character. In fact I think I will change my document
    > to specify that.
    > PS Is there a problem with the Unicode Hebrew list? Nothing seems to
    > have appeared on it today, including my previous posting on this thread
    > and Mark's reply to it.
    > --
    > Peter Kirk
    > (personal)
    > (work)

    This archive was generated by hypermail 2.1.5 : Tue Aug 19 2003 - 19:01:29 EDT