L2/03-235

Source: Ken Whistler
Date: August 5, 2003

Rick, Lisa, Cathy,

I would also like this posted as an L2 document and added to
the agenda for this topic. This is my summary of the
collation consequences of CGJ in Biblical Hebrew pointing
contexts.

I'm sorting through the thread to find the email where I
summarized the argument for CGJ as an alternative to
cloning points or changing combining classes, and I'll
forward that, momentarily, too.

--Ken

------------- Begin Forwarded Message -------------

Date: Mon, 28 Jul 2003 19:05:54 -0700 (PDT)
Subject: Re: Yerushala(y)im - or Biblical Hebrew
To: peter.r.kirk@ntlworld.com
Cc: unicode@unicode.org, kenw@sybase.com
X-archive-position: 7497
X-list: unicode

Peter Kirk asked:


> One question arises. If CGJ is used as proposed, so we have sequences 
> such as patah CGJ hiriq and perhaps meteg CGJ vowel, does this imply 
> that these sequences will necessarily be treated in collation as 
> distinct from simple patah hiriq and meteg vowel sequences (the latter 
> would of course be reversed by normalisation)? This is a simple 
> question. 


Yes. But perhaps not quite in the way you expect.

CGJ is specified as being ignored by default, and will be
weighted as such in the allkeys.txt table for the Unicode Collation
Algorithm.

To get combinations with CGJ to weight differently, you have to
take some positive action to tailor those combinations distinctly.


> I'm not yet sure if this would be desirable or not. Well, it 
> would probably be better for meteg CGJ vowel to be collated the same as 
> vowel meteg, as the distinction here is graphical but not semantic. As 
> for patah CGJ hiriq, an advantage of collating this sequence the same as 
> hiriq patah would be that existing texts which do not have CGJ here 
> would be collated together with ones which do, and perhaps that users 
> doing searches would not have to type the CGJ. But is this perhaps 
> something for which specific collation rules can be tailored?


Yes.

However, there are some subtle considerations in collation
weighting in the UCA, which may not be evident at first.
For example, let's consider the weighting, by the default
UCA table, allkeys.txt, for a sequence in question, assuming
the future version of allkeys.txt, in which U+034F CGJ will
be given an ignorable collation weight (i.e. [0000.0000.0000]).

<lamed, patah, hiriq, finalmem>

is canonically equivalent to:

<lamed, hiriq, patah, finalmem>

(which is the problem, in the first place, of course)

It is *not* canonically equivalent to:

<lamed, patah, CGJ, hiriq, finalmem>

One of the requirements of the UCA is that canonically equivalent
sequences *must* be given equivalent collation weights. The
converse is not true, since of course one of the points of the
exercise for collation is that different sequences *may* be
given equivalent collation weights.

The Unicode Collation Algorithm ensures that canonically equivalent
sequences are given equivalent collation weights by *requiring*
that NFD normalization be applied *before* looking up collation weights.

So for the above examples, what weights would be end up with?

<lamed, patah, hiriq, finalmem> --NFD--> <lamed, hiriq, patah, finalmem>

and then weights to:

05DC  ; [.0EC2.0020.0002.05DC] # HEBREW LETTER LAMED
05B4  ; [.0000.00AF.0002.05B4] # HEBREW POINT HIRIQ
05B7  ; [.0000.00B2.0002.05B7] # HEBREW POINT PATAH
05DD  ; [.0EC3.0020.0019.05DD] # HEBREW LETTER FINAL MEM; QQK

and for that sequence you would construct the weighted key as:

0EC2 0EC3 0000 0020 00AF 00B2 0020 0000 0002 0002 0002 0019

(before application of any of the techniques for key compression).

Now for the newly suggested representation, we would get:

<lamed, patah, CGJ, hiriq, finalmem>

05DC  ; [.0EC2.0020.0002.05DC] # HEBREW LETTER LAMED
05B7  ; [.0000.00B2.0002.05B7] # HEBREW POINT PATAH
034F  ; [.0000.0000.0000.0000] # COMBINING GRAPHEME JOINER
05B4  ; [.0000.00AF.0002.05B4] # HEBREW POINT HIRIQ
05DD  ; [.0EC3.0020.0019.05DD] # HEBREW LETTER FINAL MEM; QQK

and the weighted key would end up as:

0EC2 0EC3 0000 0020 00B2 00AF 0020 0000 0002 0002 0002 0019

Comparing the two, you see:

0EC2 0EC3 0000 0020 00AF 00B2 0020 0000 0002 0002 0002 0019
0EC2 0EC3 0000 0020 00B2 00AF 0020 0000 0002 0002 0002 0019
                    ^^^^^^^^^
                    
They aren't equal, at the secondary level. And why? Well,
precisely because the first two strings are canonically
equivalent to the <hiriq, patah> sequence, while the last
string, with the CGJ, is the representation for the desired
<patah, hiriq> sequence.

This is, of course, precisely the desired result -- the CGJ is
ignored for weighting, but its presence prevents the reordering
of the vowels into the undesired sequence by normalization.
And the resultant weighted key weights the vowels in the correct
order.

Tailoring of the collation table could modify any of this, but
the above example is what you get just using the default table.

But it is important that people implementing searching and sorting
for Hebrew understand why and how the CGJ is "ignored" in this
context, in order to get correct results. For example, if you
strip the CGJ and *then* hand the string to the collation weighting
algorithm, normalization will again rearrange the points into
the wrong order for weighting.

--Ken


------------- End Forwarded Message -------------