Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Aug 21 2003 - 13:05:23 EDT

  • Next message: Ben Dougall: "Re: Proposed Draft UTR #31 - Syntax Characters"

    On 21/08/2003 08:21, Mark Davis wrote:

    >OK, it sounds like we have clarity on two items:
    >
    >- change the non-final characters to <isolated>
    >- moving dagesh to after RAFE, SHIN DOT, SIN DOT, VARIKA
    >
    >I'll talk to Ken about whether we have time to get them into UCA 4.0.0, and in
    >any event we can get them into ICU 2.8 for Hebrew.
    >
    >
    These changes are certainly a move in the right direction, but only part
    of the way. If we can get these in quickly, that would be good. But we
    mustn't let things rest there.

    >As far as the strength issue of final vs dagesh, I don't think we should take
    >any immediate action. The collation strength also affects matching. If a user
    >sets the sorting or matching level to "ignore accents", for example, they
    >probably expect the dots to be ignored then, as well as graves, acutes, etc. If
    >this showed up in a lot of words, then it would still be worth doing, I suspect.
    >But because the number of cases is so very small where you would have a
    >combination of dageshes and finals that would make a difference, ...
    >
    Yes, the number of cases where the relative ordering of dagesh and final
    forms is important is vanishingly small, because final forms are nearly
    always predictable anyway.

    Nevertheless, this is an important issue. It is important, certainly in
    the biblical context, that the difference between regular and final
    forms is ignored in a basic "ignore accents" type of search. And Mati
    seems to agree: he wrote: "in most cases, the difference between Final
    vs non-Final must be ignored for searches". Compare for example ignoring
    upper and lower case differences in English. I would propose putting
    the final/non-final difference at the same level as that one.

    >... I would
    >recommend that SII approach this very carefully. If we are going to do anything,
    >it should be in the next version of UCA so that we have time to consider all of
    >the ramifications. I would not recommend it for ICU 2.8 either, even though we
    >have more time (and flexibility) there.
    >
    >
    Indeed. The issue is a lot more complex than it seems here.

    >You raise one other issue in the following:
    >
    >
    >
    >>I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
    >>of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
    >>so Dagesh should go after Varika.
    >>
    >>
    >
    >>From that, it would also appear that VARIKA should either have the same weight
    >as RAFE or at least be adjacent to. This would would only be an issue for users
    >of that character, so probably difficult to establish the right behavior, and
    >thus one we would not even try to get into UCA this round.
    >
    >We should probably take this discussion off of unicode@unicode.org, and just
    >have it on bidi@unicode.org and hebrew@unicode.org. Any people interested in
    >this topic should be on those groups anyway.
    >
    Agreed. But it seems, Mark, that you are not on the Hebrew list, as your
    posting has not reached there. So I am copying your whole posting, plus
    my additions, to the Hebrew list.

    By the way, I am not on the bidi group because I am interested mainly
    in the kinds of Hebrew issues which are independent ot specific bidi
    matters. Am I in fact missing out on important discussion of Hebrew?

    >
    >Mark
    >__________________________________
    >http://www.macchiato.com
    >► “Eppur si muove” ◄
    >
    >----- Original Message -----
    >From: "Matitiahu Allouche" <matial@il.ibm.com>
    >To: "Mark Davis" <mark.davis@jtcsv.com>
    >Cc: <unicode@unicode.org>; <bidi@unicode.org>
    >Sent: Thursday, August 21, 2003 00:55
    >Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
    >
    >
    >
    >
    >> Hello, Mark!
    >>
    >>In order to address your points in order, I will put excerpts of your note
    >>within <MARK> . . . </MARK> tags, and my comments as untagged text.
    >>
    >><MARK>
    >>A. Final.
    >>
    >>
    >>>1) Precedence of Dagesh over Final/non-Final: in the chart, the presence
    >>>or absence of Dagesh is a Secundary difference, while Final/non-Final is
    >>>
    >>>
    >>a
    >>
    >>
    >>>Tertiary difference. This is relevant only for letters Kaf and Pe. My
    >>>gut feeling says that Final/non-Final should have precedence over
    >>>Dagesh/no-Dagesh.
    >>>Note that the number of actual cases where this would make a difference
    >>>
    >>>
    >>is
    >>
    >>
    >>>probably *very* small.
    >>>
    >>>
    >>So there are two issues for final vs non-final: strength and ordering.
    >>
    >>A1. Ordering is easy to change; in ICU or UCA we could put the final
    >>values
    >>before the independent letters. In ICU they are just rules, while in UCA
    >>they
    >>follow
    >>http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table.
    >>The
    >>easiest in UCA would be to give the 5 independent forms that have finals
    >>the
    >>value <isolated>.
    >>
    >>Note: there is one minor fallout in ICU: we optimize the sortkey
    >>compression of
    >>tertiary values of NONE; if we change the ordering then each instance of
    >>the
    >><isolated> letters will mean about a 2-3 byte increase in sort-key sizes.
    >></MARK>
    >>
    >>I like giving the value <isolated> to the 5 independent forms that have
    >>finals. As for the increase in sort-key sizes, this is what cheap memory
    >>is made for :-)
    >>
    >><MARK>
    >>A2. For Strength, it is not as clear cut. If Final vs non-Final is more
    >>important than dagesh, etc, the easiest thing is to make it a primary
    >>difference; but that would make
    >>
    >>Zayin Yod PeFinal
    >>
    >>sort before all words
    >>
    >>Zayin Yod Pe XXX
    >>
    >>But I'm guessing that is probably not desired for Hebrew.
    >></MARK>
    >>
    >>Why? This is exactly what I desire for Hebrew. But I am afraid that
    >>making primary differences for Final vs non-Final will make searches using
    >>a Final form not match a non-Final form and vice-versa, which is is bad:
    >>in most cases, the difference between Final vs non-Final must be ignored
    >>for searches.
    >>
    >><MARK>
    >>In ICU we could make Final vs non-Final be a secondary difference, and
    >>have
    >>Dagesh, etc. be tertiary differences. The disadvantage is that people tend
    >>to
    >>expect the 2nd level to be 'accent-like', and there might be more
    >>inconsistencies in practice than you would gain by having the current
    >>situation.
    >></MARK>
    >>
    >>I don't think that there is enough experience accumulated to create people
    >>expectations. If this is the right thing (and I think it is), it is still
    >>early enough to do it now.
    >>
    >>In Unicode, the UCA has more production restrictions as per
    >>http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table, so
    >>it
    >>would be a bit harder to make that change.
    >>
    >>So if SII would like this change, I'd recommend that we make the ordering
    >>change
    >>in UCA (which will then affect ICU), but not make a stength change (it
    >>would
    >>have to be extremely exotic for that to make a difference).
    >></MARK>
    >>
    >>Personally, I would go for the strength change, but I understand the
    >>adverse considerations. I will have to take the matter to SII.
    >>
    >><MARK>
    >>B. Dagesh
    >>
    >>
    >>>2) There is something strange in the combinations of Shin with Dagesh
    >>>
    >>>
    >>and
    >>
    >>
    >>>dots: for all other letters, the form without Dagesh sorts before the
    >>>
    >>>
    >>form
    >>
    >>
    >>>with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
    >>>combinations with Dagesh. I cannot imagine a justification for that.
    >>>
    >>>
    >>We have currently in UCA the following (from UCA 4.0.0d1 (beta))
    >>05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
    >>05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
    >>05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
    >>05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
    >>05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
    >>05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
    >>05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
    >>05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
    >>05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
    >>05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
    >>05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
    >>05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
    >>05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
    >>05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
    >>05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
    >>FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA
    >>
    >>To make this change, we would move Dagesh to after SIN DOT. Question:
    >>should it
    >>also go after VARIKA or not?
    >></MARK>
    >>
    >>I am no expert in Judeo-Spanish, but since FB1E Varika is a glyph variant
    >>of 05BF Rafe, it makes sense that Dagesh be in the same relation to both,
    >>so Dagesh should go after Varika.
    >>
    >>
    >>Shalom (Regards), Mati
    >> Bidi Architect
    >> Globalization Center Of Competency - Bidirectional Scripts
    >> IBM Israel
    >> Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52
    >>554160
    >>
    >>
    >>
    >>
    >>
    >
    >
    >
    >
    >
    >
    >

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Thu Aug 21 2003 - 14:04:08 EDT