L2/03-235 Source: Ken Whistler Date: August 5, 2003 Rick, Lisa, Cathy, I would also like this posted as an L2 document and added to the agenda for this topic. This is my summary of the collation consequences of CGJ in Biblical Hebrew pointing contexts. I'm sorting through the thread to find the email where I summarized the argument for CGJ as an alternative to cloning points or changing combining classes, and I'll forward that, momentarily, too. --Ken ------------- Begin Forwarded Message ------------- Date: Mon, 28 Jul 2003 19:05:54 -0700 (PDT) Subject: Re: Yerushala(y)im - or Biblical Hebrew To: peter.r.kirk@ntlworld.com Cc: unicode@unicode.org, kenw@sybase.com X-archive-position: 7497 X-list: unicode Peter Kirk asked: > One question arises. If CGJ is used as proposed, so we have sequences > such as patah CGJ hiriq and perhaps meteg CGJ vowel, does this imply > that these sequences will necessarily be treated in collation as > distinct from simple patah hiriq and meteg vowel sequences (the latter > would of course be reversed by normalisation)? This is a simple > question. Yes. But perhaps not quite in the way you expect. CGJ is specified as being ignored by default, and will be weighted as such in the allkeys.txt table for the Unicode Collation Algorithm. To get combinations with CGJ to weight differently, you have to take some positive action to tailor those combinations distinctly. > I'm not yet sure if this would be desirable or not. Well, it > would probably be better for meteg CGJ vowel to be collated the same as > vowel meteg, as the distinction here is graphical but not semantic. As > for patah CGJ hiriq, an advantage of collating this sequence the same as > hiriq patah would be that existing texts which do not have CGJ here > would be collated together with ones which do, and perhaps that users > doing searches would not have to type the CGJ. But is this perhaps > something for which specific collation rules can be tailored? Yes. However, there are some subtle considerations in collation weighting in the UCA, which may not be evident at first. For example, let's consider the weighting, by the default UCA table, allkeys.txt, for a sequence in question, assuming the future version of allkeys.txt, in which U+034F CGJ will be given an ignorable collation weight (i.e. [0000.0000.0000]). is canonically equivalent to: (which is the problem, in the first place, of course) It is *not* canonically equivalent to: One of the requirements of the UCA is that canonically equivalent sequences *must* be given equivalent collation weights. The converse is not true, since of course one of the points of the exercise for collation is that different sequences *may* be given equivalent collation weights. The Unicode Collation Algorithm ensures that canonically equivalent sequences are given equivalent collation weights by *requiring* that NFD normalization be applied *before* looking up collation weights. So for the above examples, what weights would be end up with? --NFD--> and then weights to: 05DC ; [.0EC2.0020.0002.05DC] # HEBREW LETTER LAMED 05B4 ; [.0000.00AF.0002.05B4] # HEBREW POINT HIRIQ 05B7 ; [.0000.00B2.0002.05B7] # HEBREW POINT PATAH 05DD ; [.0EC3.0020.0019.05DD] # HEBREW LETTER FINAL MEM; QQK and for that sequence you would construct the weighted key as: 0EC2 0EC3 0000 0020 00AF 00B2 0020 0000 0002 0002 0002 0019 (before application of any of the techniques for key compression). Now for the newly suggested representation, we would get: 05DC ; [.0EC2.0020.0002.05DC] # HEBREW LETTER LAMED 05B7 ; [.0000.00B2.0002.05B7] # HEBREW POINT PATAH 034F ; [.0000.0000.0000.0000] # COMBINING GRAPHEME JOINER 05B4 ; [.0000.00AF.0002.05B4] # HEBREW POINT HIRIQ 05DD ; [.0EC3.0020.0019.05DD] # HEBREW LETTER FINAL MEM; QQK and the weighted key would end up as: 0EC2 0EC3 0000 0020 00B2 00AF 0020 0000 0002 0002 0002 0019 Comparing the two, you see: 0EC2 0EC3 0000 0020 00AF 00B2 0020 0000 0002 0002 0002 0019 0EC2 0EC3 0000 0020 00B2 00AF 0020 0000 0002 0002 0002 0019 ^^^^^^^^^ They aren't equal, at the secondary level. And why? Well, precisely because the first two strings are canonically equivalent to the sequence, while the last string, with the CGJ, is the representation for the desired sequence. This is, of course, precisely the desired result -- the CGJ is ignored for weighting, but its presence prevents the reordering of the vowels into the undesired sequence by normalization. And the resultant weighted key weights the vowels in the correct order. Tailoring of the collation table could modify any of this, but the above example is what you get just using the default table. But it is important that people implementing searching and sorting for Hebrew understand why and how the CGJ is "ignored" in this context, in order to get correct results. For example, if you strip the CGJ and *then* hand the string to the collation weighting algorithm, normalization will again rearrange the points into the wrong order for weighting. --Ken ------------- End Forwarded Message -------------