Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

From: Mark Davis (
Date: Tue Aug 19 2003 - 10:24:12 EDT

  • Next message: Marco Cimarosti: "RE: [Way OT] Beer measurements (was: Re: Handwritten EURO sign)"

    Ah, that explains it. You had filed this against ICU, not UCA; that explains why
    I couldn't find it in the Unicode reports.

    A. Final.
    > 1) Precedence of Dagesh over Final/non-Final: in the chart, the presence
    > or absence of Dagesh is a Secundary difference, while Final/non-Final is a
    > Tertiary difference. This is relevant only for letters Kaf and Pe. My
    > gut feeling says that Final/non-Final should have precedence over
    > Dagesh/no-Dagesh.
    > Note that the number of actual cases where this would make a difference is
    > probably *very* small.

    So there are two issues for final vs non-final: strength and ordering.

    A1. Ordering is easy to change; in ICU or UCA we could put the final values
    before the independent letters. In ICU they are just rules, while in UCA they
    follow The
    easiest in UCA would be to give the 5 independent forms that have finals the
    value <isolated>.

    Note: there is one minor fallout in ICU: we optimize the sortkey compression of
    tertiary values of NONE; if we change the ordering then each instance of the
    <isolated> letters will mean about a 2-3 byte increase in sort-key sizes.

    A2. For Strength, it is not as clear cut. If Final vs non-Final is more
    important than dagesh, etc, the easiest thing is to make it a primary
    difference; but that would make

    Zayin Yod PeFinal

    sort before all words

    Zayin Yod Pe XXX

    But I'm guessing that is probably not desired for Hebrew.

    In ICU we could make Final vs non-Final be a secondary difference, and have
    Dagesh, etc. be tertiary differences. The disadvantage is that people tend to
    expect the 2nd level to be 'accent-like', and there might be more
    inconsistencies in practice than you would gain by having the current situation.
    In Unicode, the UCA has more production restrictions as per, so it
    would be a bit harder to make that change.

    So if SII would like this change, I'd recommend that we make the ordering change
    in UCA (which will then affect ICU), but not make a stength change (it would
    have to be extremely exotic for that to make a difference).


    B. Dagesh
    > 2) There is something strange in the combinations of Shin with Dagesh and
    > dots: for all other letters, the form without Dagesh sorts before the form
    > with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
    > combinations with Dagesh. I cannot imagine a justification for that.

    We have currently in UCA the following (from UCA 4.0.0d1 (beta))
    05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
    05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
    05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
    05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
    05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
    05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
    05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
    05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
    05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
    05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
    05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
    05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
    05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
    05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
    05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT

    To make this change, we would move Dagesh to after SIN DOT. Question: should it
    also go after VARIKA or not?

    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Matitiahu Allouche" <>
    To: "Mark Davis" <>
    Cc: <>; <>; <>
    Sent: Tuesday, August 19, 2003 01:21
    Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

    > Hello, Mark!
    > There must be some hole in your email archive :-), since you yourself
    > expressed your personal take on the issues. On 04/05/03 (probably 4th of
    > May rather than 5th of April) you wrote me:
    > <QUOTE>
    > From: Mark Davis@IBMUS on 04/05/2003 03:22
    > To: Matitiahu Allouche/Israel/IBM@IBMIL
    > cc: Israel Gidali/Israel/IBM@IBMIL
    > From: Mark Davis/Cupertino/IBM@IBMUS
    > Subject: Bug on Hebrew Collation
    > Importance: Urgent
    > I am working through some collation bugs, and had a question about:
    > Mati, your comments look reasonable. I am, however, a little nervous since
    > as far as I know, the Israeli government committee had input into the
    > basic table for ISO 14651, which is reflected in the UCA. (We don't modify
    > it for Hebrew). Can you confirm with them that these tailorings should be
    > made?
    > Mark
    > </QUOTE>
    > I did not formally submit anything to the UTC, though, so I may be
    > responsible for my own misfortune. At that time, I had 4 remarks. It
    > seems that 2 of them have been implemented, and the 2 others have not.
    > I have second thoughts about the tertiary weight allocated to final
    > letters (0019) as compared to that allocated to non-final letters (0002).
    > That means that final letters are collated *after* the corresponding
    > non-final letters. This goes against accepted Hebrew usage. In normal
    > cases, the non-final letter will be followed by some more letters, so that
    > there will be a primary difference, but exotic cases will be sorted
    > improperly. An example that comes to mind is transliteration of
    > non-Hebrew words. For instance a "zip" file will be transliterated as
    > "Zayin Yod Pe" (Google gives 2840 hits for this orthograph). There is a
    > Hebrew word pronounced "zif" (meaning "bristle") which is written
    > identically except that the last letter is a Final Pe. I expect the "zip"
    > file to be collated *after* the "bristle", but this will not happen with
    > the current collation table.
    > I would feel more comfortable if:
    > a) Final letters had a smaller weight than the corresponding non-final
    > letters (for some level >1).
    > b) The level associated with final/non-final was more significant than the
    > level associated with diacritics (Dagesh and/or other Hebrew points).
    > It is not that I have so many really convincing examples that would be
    > broken with the current collation definition, but I think that having
    > weights which reflect the linguistic guidelines is more likely to
    > successfully handle the cases that we have not considered.
    > Shalom (Regards), Mati
    > Bidi Architect
    > Globalization Center Of Competency - Bidirectional Scripts
    > IBM Israel
    > Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52
    > 554160
    > To: Matitiahu Allouche/Israel/IBM@IBMIL
    > cc: <>, <>, <>
    > Subject: Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
    > I'm sorry that you haven't gotten responses before. I have searched
    > through my
    > email archive, and can't find anything like the message, and I don't think
    > it
    > was brought up to the UTC formally.
    > The first one seems odd, and as you say, it would seem to only affect a
    > vanishingly small number of characters; since these are final character,
    > one
    > presumes there would be subsequent characters that would form a larger
    > difference anyway.
    > Mark

    This archive was generated by hypermail 2.1.5 : Tue Aug 19 2003 - 11:33:44 EDT