Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

From: Peter Kirk (
Date: Tue Aug 19 2003 - 18:10:18 EDT

    On 19/08/2003 07:24, Mark Davis wrote:

    >B. Dagesh
    >>2) There is something strange in the combinations of Shin with Dagesh and
    >>dots: for all other letters, the form without Dagesh sorts before the form
    >>with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
    >>combinations with Dagesh. I cannot imagine a justification for that.
    >We have currently in UCA the following (from UCA 4.0.0d1 (beta))
    >05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
    >05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
    >05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
    >05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
    >05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
    >05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
    >05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
    >05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
    >05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
    >05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
    >05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
    >05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
    >05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
    >05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
    >05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
    >To make this change, we would move Dagesh to after SIN DOT. Question: should it
    >also go after VARIKA or not?
    Please, don't rush any changes to the UCA here. We need a proper review
    of what is required for biblical as well as modern Hebrew (hopefully the
    same but possibly not), not just a quick conclusion that we fix things
    by reordering dagesh.

    A lot of the problem with dagesh etc comes from the highly inappropriate
    canonical combining classes for U+05B0 to U+05C4. I was told not long
    ago that the ordering of these didn't matter, only the distinctions do,
    but the ordering sure does matter when it comes to collation. Shin with
    dagesh and patah is logically <shin, shin dot, dagesh, patah> and
    should probably be collated on the basis of that ordering, i.e. sort
    first by the sin/shin dot, then by whether there is dagesh or not, then
    by the vowel. But the canonically ordered NFD which is the input to
    collation is <shin, patah, dagesh, shin dot>. So somehow the collation
    algorithm has to be asked to undo the damage which normalisation did and
    collate these things in the right order.

    And please don't discuss Hebrew here in isolation from the discussion of
    the same subject on the Hebrew list - at least the discussion which I
    was raising there on the understanding that matters of Hebrew were
    supposed to be discussed there.

    Peter Kirk (personal) (work)

