Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Thu Aug 21 2003 - 11:21:00 EDT

  • Next message: Doug Ewell: "Re: Hexadecimal never again"

    OK, it sounds like we have clarity on two items:

    - change the non-final characters to <isolated>
    - moving dagesh to after RAFE, SHIN DOT, SIN DOT, VARIKA

    I'll talk to Ken about whether we have time to get them into UCA 4.0.0, and in
    any event we can get them into ICU 2.8 for Hebrew.

    As far as the strength issue of final vs dagesh, I don't think we should take
    any immediate action. The collation strength also affects matching. If a user
    sets the sorting or matching level to "ignore accents", for example, they
    probably expect the dots to be ignored then, as well as graves, acutes, etc. If
    this showed up in a lot of words, then it would still be worth doing, I suspect.
    But because the number of cases is so very small where you would have a
    combination of dageshes and finals that would make a difference, I would
    recommend that SII approach this very carefully. If we are going to do anything,
    it should be in the next version of UCA so that we have time to consider all of
    the ramifications. I would not recommend it for ICU 2.8 either, even though we
    have more time (and flexibility) there.

    You raise one other issue in the following:

    > I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
    > of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
    > so Dagesh should go after Varika.

    From that, it would also appear that VARIKA should either have the same weight
    as RAFE or at least be adjacent to. This would would only be an issue for users
    of that character, so probably difficult to establish the right behavior, and
    thus one we would not even try to get into UCA this round.

    We should probably take this discussion off of unicode@unicode.org, and just
    have it on bidi@unicode.org and hebrew@unicode.org. Any people interested in
    this topic should be on those groups anyway.

    Mark
    __________________________________
    http://www.macchiato.com
    ► “Eppur si muove” ◄

    ----- Original Message -----
    From: "Matitiahu Allouche" <matial@il.ibm.com>
    To: "Mark Davis" <mark.davis@jtcsv.com>
    Cc: <unicode@unicode.org>; <bidi@unicode.org>
    Sent: Thursday, August 21, 2003 00:55
    Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

    > Hello, Mark!
    >
    > In order to address your points in order, I will put excerpts of your note
    > within <MARK> . . . </MARK> tags, and my comments as untagged text.
    >
    > <MARK>
    > A. Final.
    > > 1) Precedence of Dagesh over Final/non-Final: in the chart, the presence
    > > or absence of Dagesh is a Secundary difference, while Final/non-Final is
    > a
    > > Tertiary difference. This is relevant only for letters Kaf and Pe. My
    > > gut feeling says that Final/non-Final should have precedence over
    > > Dagesh/no-Dagesh.
    > > Note that the number of actual cases where this would make a difference
    > is
    > > probably *very* small.
    >
    > So there are two issues for final vs non-final: strength and ordering.
    >
    > A1. Ordering is easy to change; in ICU or UCA we could put the final
    > values
    > before the independent letters. In ICU they are just rules, while in UCA
    > they
    > follow
    > http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table.
    > The
    > easiest in UCA would be to give the 5 independent forms that have finals
    > the
    > value <isolated>.
    >
    > Note: there is one minor fallout in ICU: we optimize the sortkey
    > compression of
    > tertiary values of NONE; if we change the ordering then each instance of
    > the
    > <isolated> letters will mean about a 2-3 byte increase in sort-key sizes.
    > </MARK>
    >
    > I like giving the value <isolated> to the 5 independent forms that have
    > finals. As for the increase in sort-key sizes, this is what cheap memory
    > is made for :-)
    >
    > <MARK>
    > A2. For Strength, it is not as clear cut. If Final vs non-Final is more
    > important than dagesh, etc, the easiest thing is to make it a primary
    > difference; but that would make
    >
    > Zayin Yod PeFinal
    >
    > sort before all words
    >
    > Zayin Yod Pe XXX
    >
    > But I'm guessing that is probably not desired for Hebrew.
    > </MARK>
    >
    > Why? This is exactly what I desire for Hebrew. But I am afraid that
    > making primary differences for Final vs non-Final will make searches using
    > a Final form not match a non-Final form and vice-versa, which is is bad:
    > in most cases, the difference between Final vs non-Final must be ignored
    > for searches.
    >
    > <MARK>
    > In ICU we could make Final vs non-Final be a secondary difference, and
    > have
    > Dagesh, etc. be tertiary differences. The disadvantage is that people tend
    > to
    > expect the 2nd level to be 'accent-like', and there might be more
    > inconsistencies in practice than you would gain by having the current
    > situation.
    > </MARK>
    >
    > I don't think that there is enough experience accumulated to create people
    > expectations. If this is the right thing (and I think it is), it is still
    > early enough to do it now.
    >
    > In Unicode, the UCA has more production restrictions as per
    > http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table, so
    > it
    > would be a bit harder to make that change.
    >
    > So if SII would like this change, I'd recommend that we make the ordering
    > change
    > in UCA (which will then affect ICU), but not make a stength change (it
    > would
    > have to be extremely exotic for that to make a difference).
    > </MARK>
    >
    > Personally, I would go for the strength change, but I understand the
    > adverse considerations. I will have to take the matter to SII.
    >
    > <MARK>
    > B. Dagesh
    > > 2) There is something strange in the combinations of Shin with Dagesh
    > and
    > > dots: for all other letters, the form without Dagesh sorts before the
    > form
    > > with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
    > > combinations with Dagesh. I cannot imagine a justification for that.
    >
    > We have currently in UCA the following (from UCA 4.0.0d1 (beta))
    > 05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
    > 05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
    > 05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
    > 05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
    > 05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
    > 05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
    > 05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
    > 05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
    > 05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
    > 05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
    > 05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
    > 05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
    > 05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
    > 05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
    > 05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
    > FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA
    >
    > To make this change, we would move Dagesh to after SIN DOT. Question:
    > should it
    > also go after VARIKA or not?
    > </MARK>
    >
    > I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
    > of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
    > so Dagesh should go after Varika.
    >
    >
    > Shalom (Regards), Mati
    > Bidi Architect
    > Globalization Center Of Competency - Bidirectional Scripts
    > IBM Israel
    > Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52
    > 554160
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Aug 21 2003 - 12:29:22 EDT