Hebrew collation, was: Merging combining classes

From: Peter Kirk (peterkirk@qaya.org)
Date: Tue Oct 28 2003 - 05:25:35 CST


On 27/10/2003 16:39, Philippe Verdy wrote:

>From: "Peter Kirk" <peterkirk@qaya.org>
>
>
>
>>On 27/10/2003 12:28, Mark Davis wrote:
>>
>>
>>
>>>Collation is very different, and already has mechanisms for dealing with
>>>sequences. So no CGJ is needed there (except for case 2).
>>>
>>>Mark
>>>
>>>
>>>
>>>
>>>
>>Mark, can you outline what these mechanisms are or point me to a
>>definition e.g. in a section of UTR #10? As I had understood it, the
>>only way to deal with sequences of the sort I have in mind is to list
>>each possible individually as a contraction. The Logical_Order_Exception
>>property (see http://www.unicode.org/reports/tr10/ section 3.1.3) just
>>might be useful, but doesn't seem to have the necessary flexibility as
>>it causes a character to be swapped with ANY following character, not
>>just with any of a restricted list of such characters. The backwards
>>marking used for French accents (section 3.1.2) seems to apply over too
>>long a string.
>>
>>
>
>The backwards marking is not restricted to French accents in collation
>level 2. You can use reverse ordering at any tailored level to fit other
>needs, and you can also insert an extra collation level.
>
>So I think that Mark is right here as it gives you full control on the
>length
>of the collating sequence at each level of the collation keys. The case 2
>is effectively an exception.
>
>The bad thing is that the current default UCA ordering table does not create
>such collation keys with intermediate levels for Hebrew vowels, and you
>need tailoring to create a base level with consonnants, one level with
>vowels, a third level for sin/shin dots, a fourth for meteg, a fifth for
>accents...
>unless the text is encoded in logical order using the CCO-convention.
>
>Philippe.
>
>
I know there was quite a lot of discussion of collation of Hebrew in
August, confused partly because it was spread over three lists (unicode,
bidi and hebrew). I don't think we found a good solution then except to
define as contractions each of several hundred possible combinations
following a shin.

I wonder if it might work (either in DUCET or in a tailored collation)
to make the Hebrew vowel distinctions a third level sort, with the
consonant modifiers dagesh, rafe and sin and shin dot at the second
level, and accents at the fourth level. Contractions could then be made
for dagesh, rafe and sin/shin dot so that the latter, which follows in
the canonical order, will be collated as if coming first; and there are
not many combinations, although we do have to allow for intervening
meteg, which has fourth level significance.

Thus we might need something like the following data, with some of the
values chosen arbitrarily (i.e. for what was least editing from my source!):

05B0 ; [.0000.0000.00B2.05B0] # HEBREW POINT SHEVA
05B1 ; [.0000.0000.00B3.05B1] # HEBREW POINT HATAF SEGOL
05B2 ; [.0000.0000.00B4.05B2] # HEBREW POINT HATAF PATAH
05B3 ; [.0000.0000.00B5.05B3] # HEBREW POINT HATAF QAMATS
05B4 ; [.0000.0000.00B6.05B4] # HEBREW POINT HIRIQ
05B5 ; [.0000.0000.00B7.05B5] # HEBREW POINT TSERE
05B6 ; [.0000.0000.00B8.05B6] # HEBREW POINT SEGOL
05B7 ; [.0000.0000.00B9.05B7] # HEBREW POINT PATAH
05B8 ; [.0000.0000.00BA.05B8] # HEBREW POINT QAMATS
05B9 ; [.0000.0000.00BB.05B9] # HEBREW POINT HOLAM
05BB ; [.0000.0000.00BC.05BB] # HEBREW POINT QUBUTS
05BC ; [.0000.00BD.0002.05BC] # HEBREW POINT DAGESH OR MAPIQ
05BC 05C1 ; [.0000.00C1.0002.05C1] [.0000.00BD.0002.05BC] # dagesh and
shin dot
05BC 05C2 ; [.0000.00C2.0002.05C2] [.0000.00BD.0002.05BC] # dagesh and
sin dot
05BC 05BD 05C1 ; [.0000.00C1.0002.05C1] [.0000.00BD.0002.05BC]
[.0000.0000.0000.05BD] # dagesh, meteg and shin dot
05BC 05BD 05C2 ; [.0000.00C2.0002.05C2] [.0000.00BD.0002.05BC]
[.0000.0000.0000.05BD] # dagesh, meteg and sin dot
05BF ; [.0000.00C0.0002.05BF] # HEBREW POINT RAFE
05BF 05C1 ; [.0000.00C1.0002.05C1] [.0000.00C0.0002.05BF] # rafe and
shin dot
05BF 05C2 ; [.0000.00C2.0002.05C2] [.0000.00C0.0002.05BF] # rafe and sin dot
05C1 ; [.0000.00C1.0002.05C1] # HEBREW POINT SHIN DOT
05C2 ; [.0000.00C2.0002.05C2] # HEBREW POINT SIN DOT

plus in principle some extra contractions with both dagesh and rafe.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/


This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST