Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Tue Aug 19 2003 - 17:23:11 EDT

  • Next message: Michael Everson: "Re: Last Resort Font"

    Three points.

    First, While we try to make the the UCA collation table (DUCET) as reasonable as
    possible for the main languages of a given script, it is not guaranteed to
    produce the correct sorting for any particular language. The UCA *is* designed
    so that it provides a default base ordering for all of Unicode, and individual
    languages can be given tailorings of the DUCET that handle the specifics of
    their string comparison requirements.

    Thus if there are changes that improve the handling of the UCA for the major
    languages using a given script, and do not destabilize others, those are
    candidates for change in a version. For example, if it turned out that a
    particular Tamil character (or sequence of characters!) was not sorted correctly
    according to the DUCET (e.g. on http://www.unicode.org/charts/collation/beta/),
    then it would be a candidate, and should be submitted on the form.

    Second, we do and should favor modern language communities when making
    incompatible tradeoffs. So if we have the choice between making French sort
    correctly without tailoring, or have Latin sort correctly without tailoring, we
    should choose the modern community. The Latin community can always use a
    tailored UCA, in any event.

    Third, there is often a serious confusion between sorting weight and canonical
    ordering. The fact that a grave accent precedes a cedilla in canonical order is
    *completely independent of* whatever collation weights each of them has, either
    in a tailoring or in the DUCET. The only substantive issue is how each of these
    sorts separately or in combination. And making the combination (sequence) of
    grave and cedilla sort before grave, after grave, before cedilla, or after
    cedilla are all possible; all of those can be handled by the UCA as
    contractions. See http://www.unicode.org/reports/tr10/tr10-10.html for more
    information.

    Mark
    __________________________________
    http://www.macchiato.com
    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Peter Kirk" <peter.r.kirk@ntlworld.com>
    To: "Mark Davis" <mark.davis@jtcsv.com>
    Cc: "Matitiahu Allouche" <matial@il.ibm.com>; <unicode@unicode.org>;
    <bidi@unicode.org>; <hebrew@unicode.org>; "Joan Wardell" <Joan_Wardell@sil.org>
    Sent: Tuesday, August 19, 2003 13:59
    Subject: Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

    > On 19/08/2003 07:24, Mark Davis wrote:
    >
    > >B. Dagesh
    > >
    > >
    > >>2) There is something strange in the combinations of Shin with Dagesh and
    > >>dots: for all other letters, the form without Dagesh sorts before the form
    > >>with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
    > >>combinations with Dagesh. I cannot imagine a justification for that.
    > >>
    > >>
    > >
    > >We have currently in UCA the following (from UCA 4.0.0d1 (beta))
    > >05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
    > >05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
    > >05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
    > >05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
    > >05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
    > >05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
    > >05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
    > >05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
    > >05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
    > >05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
    > >05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
    > >05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
    > >05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
    > >05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
    > >05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
    > >FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA
    > >
    > >To make this change, we would move Dagesh to after SIN DOT. Question: should
    it
    > >also go after VARIKA or not?
    > >
    > >Mark
    > >__________________________________
    > >http://www.macchiato.com
    > >► “Eppur si muove” ◄
    > >
    > >
    > >
    > Please, don't rush any changes to the UCA here. We need a proper review
    > of what is required for biblical as well as modern Hebrew (hopefully the
    > same but possibly not), not just a quick conclusion that we fix things
    > by reordering dagesh.
    >
    > A lot of the problem with dagesh etc comes from the highly inappropriate
    > canonical combining classes for U+05B0 to U+05C4. I was told not long
    > ago that the ordering of these didn't matter, only the distinctions do,
    > but the ordering sure does matter when it comes to collation. Shin with
    > dagesh and patah is logically <shin, shin dot, dagesh, patah> and
    > should probably be collated on the basis of that ordering, i.e. sort
    > first by the sin/shin dot, then by whether there is dagesh or not, then
    > by the vowel. But the canonically ordered NFD which is the input to
    > collation is <shin, patah, dagesh, shin dot>. So somehow the collation
    > algorithm has to be asked to undo the damage which normalisation did and
    > collate these things in the right order.
    >
    > And please don't discuss Hebrew here in isolation from the discussion of
    > the same subject on the Hebrew list - at least the discussion which I
    > was raising there on the understanding that matters of Hebrew were
    > supposed to be discussed there.
    >
    > --
    > Peter Kirk
    > peter@qaya.org (personal)
    > peterkirk@qaya.org (work)
    > http://www.qaya.org/
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Aug 19 2003 - 18:01:29 EDT