Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Tue Aug 19 2003 - 10:24:12 EDT

Next message: Marco Cimarosti: "RE: [Way OT] Beer measurements (was: Re: Handwritten EURO sign)"

Previous message: Rick McGowan: "Re: Hexadecimal again (was RE: Clones)"
Maybe in reply to: Mark Davis: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
Next in thread: Peter Kirk: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
Reply: Peter Kirk: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Ah, that explains it. You had filed this against ICU, not UCA; that explains why
I couldn't find it in the Unicode reports.

A. Final.
> 1) Precedence of Dagesh over Final/non-Final: in the chart, the presence
> or absence of Dagesh is a Secundary difference, while Final/non-Final is a
> Tertiary difference. This is relevant only for letters Kaf and Pe. My
> gut feeling says that Final/non-Final should have precedence over
> Dagesh/no-Dagesh.
> Note that the number of actual cases where this would make a difference is
> probably *very* small.

So there are two issues for final vs non-final: strength and ordering.

A1. Ordering is easy to change; in ICU or UCA we could put the final values
before the independent letters. In ICU they are just rules, while in UCA they
follow
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table. The
easiest in UCA would be to give the 5 independent forms that have finals the
value <isolated>.

Note: there is one minor fallout in ICU: we optimize the sortkey compression of
tertiary values of NONE; if we change the ordering then each instance of the
<isolated> letters will mean about a 2-3 byte increase in sort-key sizes.

A2. For Strength, it is not as clear cut. If Final vs non-Final is more
important than dagesh, etc, the easiest thing is to make it a primary
difference; but that would make

Zayin Yod PeFinal

sort before all words

Zayin Yod Pe XXX

But I'm guessing that is probably not desired for Hebrew.

In ICU we could make Final vs non-Final be a secondary difference, and have
Dagesh, etc. be tertiary differences. The disadvantage is that people tend to
expect the 2nd level to be 'accent-like', and there might be more
inconsistencies in practice than you would gain by having the current situation.
In Unicode, the UCA has more production restrictions as per
http://www.unicode.org/reports/tr10/tr10-10.html#Tertiary_Weight_Table, so it
would be a bit harder to make that change.

So if SII would like this change, I'd recommend that we make the ordering change
in UCA (which will then affect ICU), but not make a stength change (it would
have to be extremely exotic for that to make a difference).

Cf. http://www.unicode.org/charts/collation/chart_Hebrew.html

B. Dagesh
> 2) There is something strange in the combinations of Shin with Dagesh and
> dots: for all other letters, the form without Dagesh sorts before the form
> with Dagesh. But Shin with Sin/Shin dot sort after their corresponding
> combinations with Dagesh. I cannot imagine a justification for that.

We have currently in UCA the following (from UCA 4.0.0d1 (beta))
05B0 ; [.0000.00B2.0002.05B0] # HEBREW POINT SHEVA
05B1 ; [.0000.00B3.0002.05B1] # HEBREW POINT HATAF SEGOL
05B2 ; [.0000.00B4.0002.05B2] # HEBREW POINT HATAF PATAH
05B3 ; [.0000.00B5.0002.05B3] # HEBREW POINT HATAF QAMATS
05B4 ; [.0000.00B6.0002.05B4] # HEBREW POINT HIRIQ
05B5 ; [.0000.00B7.0002.05B5] # HEBREW POINT TSERE
05B6 ; [.0000.00B8.0002.05B6] # HEBREW POINT SEGOL
05B7 ; [.0000.00B9.0002.05B7] # HEBREW POINT PATAH
05B8 ; [.0000.00BA.0002.05B8] # HEBREW POINT QAMATS
05B9 ; [.0000.00BB.0002.05B9] # HEBREW POINT HOLAM
05BB ; [.0000.00BC.0002.05BB] # HEBREW POINT QUBUTS
05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT
FB1E ; [.0000.00C3.0002.FB1E] # HEBREW POINT JUDEO-SPANISH VARIKA

To make this change, we would move Dagesh to after SIN DOT. Question: should it
also go after VARIKA or not?

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Matitiahu Allouche" <matial@il.ibm.com>
To: "Mark Davis" <mark.davis@jtcsv.com>
Cc: <unicode@unicode.org>; <bidi@unicode.org>; <indic@unicode.org>
Sent: Tuesday, August 19, 2003 01:21
Subject: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

> Hello, Mark!
>
> There must be some hole in your email archive :-), since you yourself
> expressed your personal take on the issues. On 04/05/03 (probably 4th of
> May rather than 5th of April) you wrote me:
> <QUOTE>
> From: Mark Davis@IBMUS on 04/05/2003 03:22
> To: Matitiahu Allouche/Israel/IBM@IBMIL
> cc: Israel Gidali/Israel/IBM@IBMIL
>
> From: Mark Davis/Cupertino/IBM@IBMUS
> Subject: Bug on Hebrew Collation
> Importance: Urgent
>
>
> I am working through some collation bugs, and had a question about:
>
> http://www.jtcsv.com/cgibin/icu-bugs/collation?id=1489;user=guest
>
> Mati, your comments look reasonable. I am, however, a little nervous since
> as far as I know, the Israeli government committee had input into the
> basic table for ISO 14651, which is reflected in the UCA. (We don't modify
> it for Hebrew). Can you confirm with them that these tailorings should be
> made?
>
> Mark
> </QUOTE>
>
> I did not formally submit anything to the UTC, though, so I may be
> responsible for my own misfortune. At that time, I had 4 remarks. It
> seems that 2 of them have been implemented, and the 2 others have not.
>
> I have second thoughts about the tertiary weight allocated to final
> letters (0019) as compared to that allocated to non-final letters (0002).
> That means that final letters are collated *after* the corresponding
> non-final letters. This goes against accepted Hebrew usage. In normal
> cases, the non-final letter will be followed by some more letters, so that
> there will be a primary difference, but exotic cases will be sorted
> improperly. An example that comes to mind is transliteration of
> non-Hebrew words. For instance a "zip" file will be transliterated as
> "Zayin Yod Pe" (Google gives 2840 hits for this orthograph). There is a
> Hebrew word pronounced "zif" (meaning "bristle") which is written
> identically except that the last letter is a Final Pe. I expect the "zip"
> file to be collated *after* the "bristle", but this will not happen with
> the current collation table.
>
> I would feel more comfortable if:
> a) Final letters had a smaller weight than the corresponding non-final
> letters (for some level >1).
> b) The level associated with final/non-final was more significant than the
> level associated with diacritics (Dagesh and/or other Hebrew points).
> It is not that I have so many really convincing examples that would be
> broken with the current collation definition, but I think that having
> weights which reflect the linguistic guidelines is more likely to
> successfully handle the cases that we have not considered.
>
> Shalom (Regards), Mati
> Bidi Architect
> Globalization Center Of Competency - Bidirectional Scripts
> IBM Israel
> Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52
> 554160
>
>
> To: Matitiahu Allouche/Israel/IBM@IBMIL
> cc: <unicode@unicode.org>, <bidi@unicode.org>, <indic@unicode.org>
> Subject: Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)
>
>
> I'm sorry that you haven't gotten responses before. I have searched
> through my
> email archive, and can't find anything like the message, and I don't think
> it
> was brought up to the UTC formally.
>
> The first one seems odd, and as you say, it would seem to only affect a
> vanishingly small number of characters; since these are final character,
> one
> presumes there would be subsequent characters that would form a larger
> difference anyway.
>
> Mark
>
>
>

Next message: Marco Cimarosti: "RE: [Way OT] Beer measurements (was: Re: Handwritten EURO sign)"
Previous message: Rick McGowan: "Re: Hexadecimal again (was RE: Clones)"
Maybe in reply to: Mark Davis: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
Next in thread: Peter Kirk: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
Reply: Peter Kirk: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Aug 19 2003 - 11:33:44 EDT