Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)

From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Aug 19 2003 - 18:11:23 EDT

Next message: Michael Everson: "Re: Last Resort Font"

Previous message: Peter Kirk: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
In reply to: Mark Davis: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
Next in thread: Mark Davis: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Resending with the correct address...

On 19/08/2003 14:23, Mark Davis wrote:

>Three points.
>
>First, While we try to make the the UCA collation table (DUCET) as reasonable as
>possible for the main languages of a given script, it is not guaranteed to
>produce the correct sorting for any particular language. The UCA *is* designed
>so that it provides a default base ordering for all of Unicode, and individual
>languages can be given tailorings of the DUCET that handle the specifics of
>their string comparison requirements.
>
>Thus if there are changes that improve the handling of the UCA for the major
>languages using a given script, and do not destabilize others, those are
>candidates for change in a version. For example, if it turned out that a
>particular Tamil character (or sequence of characters!) was not sorted correctly
>according to the DUCET (e.g. on http://www.unicode.org/charts/collation/beta/),
>then it would be a candidate, and should be submitted on the form.
>
>
Understood. On this basis, the DUCET sorting for the Hebrew block should
be based on the requirements for modern Hebrew, with Yiddish, Ladino etc
also being taken into acount.

>Second, we do and should favor modern language communities when making
>incompatible tradeoffs. So if we have the choice between making French sort
>correctly without tailoring, or have Latin sort correctly without tailoring, we
>should choose the modern community. The Latin community can always use a
>tailored UCA, in any event.
>
>
Understood. I accept the primacy of the modern language in this case.
There may be some issues on which the modern language has no
preference, especially for characters only used in older Hebrew, and in
such cases it would make sense to follow the preferences of ancient
Hebrew scholars. If it becomes necessary to use a tailored UCA for
biblical work, so be it, but I would prefer not to. We have come close
to having to use a separate set of vowels for biblical Hebrew simply
because decisions were rushed and then frozen on the basis of modern
Hebrew requirements. I don't want any danger of falling into the same
kind of trap with collation.

>Third, there is often a serious confusion between sorting weight and canonical
>ordering. The fact that a grave accent precedes a cedilla in canonical order is
>*completely independent of* whatever collation weights each of them has, either
>in a tailoring or in the DUCET. The only substantive issue is how each of these
>sorts separately or in combination. And making the combination (sequence) of
>grave and cedilla sort before grave, after grave, before cedilla, or after
>cedilla are all possible; all of those can be handled by the UCA as
>contractions. See http://www.unicode.org/reports/tr10/tr10-10.html for more
>information.
>
>
Yes, I understand that the collation weights are quite independent of
the canonical combining classes. But collation does become trickier
when the canonical ordering is not the expected one, because of the
assumption that collation is based on the order of the string i.e. based
on the first character, then the second etc.

Well, I am glad that contractions provide a way around that problem. So
perhaps we ought to be looking at using them for Hebrew in DUCET. I
guess we should consider defining contractions for each case of
<consonant, dagesh> which differ from the consonant at the second level
only, perhaps also the same for rafe, and similarly for each combination
of shin, shin/sin dot and dagesh. The problem comes that the vowels
intrude between the consonant and the dagesh, and meteg comes before
shin/sin dot, so there is a potential need for a rather large number of
contractions, especially if we consider a shin with a right meteg which
might come out as:

<shin, dagesh, meteg, CGJ, {any one of 11 vowels}, {optional shin dot |
sin dot}, masora circle>

with the CGJ inhibiting complete canonical reordering, and the shin/sin
dot must be contracted with the shin.

Perhaps we need to specify that dagesh and shin/sin dot must always come
BEFORE any CGJ in such combinations so that they don't get separated too
far from the base character. In fact I think I will change my document
to specify that.

PS Is there a problem with the Unicode Hebrew list? Nothing seems to
have appeared on it today, including my previous posting on this thread
and Mark's reply to it.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Michael Everson: "Re: Last Resort Font"
Previous message: Peter Kirk: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
In reply to: Mark Davis: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
Next in thread: Mark Davis: "Re: [bidi] Re: Unicode Collation Algorithm: 4.0 Update (beta)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Aug 19 2003 - 18:54:20 EDT