Re: Yerushala(y)im - or Biblical Hebrew

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 28 2003 - 22:05:54 EDT

  • Next message: Kenneth Whistler: "Re: Yerushala(y)im - or Biblical Hebrew"

    Peter Kirk asked:

    > One question arises. If CGJ is used as proposed, so we have sequences
    > such as patah CGJ hiriq and perhaps meteg CGJ vowel, does this imply
    > that these sequences will necessarily be treated in collation as
    > distinct from simple patah hiriq and meteg vowel sequences (the latter
    > would of course be reversed by normalisation)? This is a simple
    > question.

    Yes. But perhaps not quite in the way you expect.

    CGJ is specified as being ignored by default, and will be
    weighted as such in the allkeys.txt table for the Unicode Collation
    Algorithm.

    To get combinations with CGJ to weight differently, you have to
    take some positive action to tailor those combinations distinctly.

    > I'm not yet sure if this would be desirable or not. Well, it
    > would probably be better for meteg CGJ vowel to be collated the same as
    > vowel meteg, as the distinction here is graphical but not semantic. As
    > for patah CGJ hiriq, an advantage of collating this sequence the same as
    > hiriq patah would be that existing texts which do not have CGJ here
    > would be collated together with ones which do, and perhaps that users
    > doing searches would not have to type the CGJ. But is this perhaps
    > something for which specific collation rules can be tailored?

    Yes.

    However, there are some subtle considerations in collation
    weighting in the UCA, which may not be evident at first.
    For example, let's consider the weighting, by the default
    UCA table, allkeys.txt, for a sequence in question, assuming
    the future version of allkeys.txt, in which U+034F CGJ will
    be given an ignorable collation weight (i.e. [0000.0000.0000]).

    <lamed, patah, hiriq, finalmem>

    is canonically equivalent to:

    <lamed, hiriq, patah, finalmem>

    (which is the problem, in the first place, of course)

    It is *not* canonically equivalent to:

    <lamed, patah, CGJ, hiriq, finalmem>

    One of the requirements of the UCA is that canonically equivalent
    sequences *must* be given equivalent collation weights. The
    converse is not true, since of course one of the points of the
    exercise for collation is that different sequences *may* be
    given equivalent collation weights.

    The Unicode Collation Algorithm ensures that canonically equivalent
    sequences are given equivalent collation weights by *requiring*
    that NFD normalization be applied *before* looking up collation weights.

    So for the above examples, what weights would be end up with?

    <lamed, patah, hiriq, finalmem> --NFD--> <lamed, hiriq, patah, finalmem>

    and then weights to:

    05DC ; [.0EC2.0020.0002.05DC] # HEBREW LETTER LAMED
    05B4 ; [.0000.00AF.0002.05B4] # HEBREW POINT HIRIQ
    05B7 ; [.0000.00B2.0002.05B7] # HEBREW POINT PATAH
    05DD ; [.0EC3.0020.0019.05DD] # HEBREW LETTER FINAL MEM; QQK

    and for that sequence you would construct the weighted key as:

    0EC2 0EC3 0000 0020 00AF 00B2 0020 0000 0002 0002 0002 0019

    (before application of any of the techniques for key compression).

    Now for the newly suggested representation, we would get:

    <lamed, patah, CGJ, hiriq, finalmem>

    05DC ; [.0EC2.0020.0002.05DC] # HEBREW LETTER LAMED
    05B7 ; [.0000.00B2.0002.05B7] # HEBREW POINT PATAH
    034F ; [.0000.0000.0000.0000] # COMBINING GRAPHEME JOINER
    05B4 ; [.0000.00AF.0002.05B4] # HEBREW POINT HIRIQ
    05DD ; [.0EC3.0020.0019.05DD] # HEBREW LETTER FINAL MEM; QQK

    and the weighted key would end up as:

    0EC2 0EC3 0000 0020 00B2 00AF 0020 0000 0002 0002 0002 0019

    Comparing the two, you see:

    0EC2 0EC3 0000 0020 00AF 00B2 0020 0000 0002 0002 0002 0019
    0EC2 0EC3 0000 0020 00B2 00AF 0020 0000 0002 0002 0002 0019
                        ^^^^^^^^^
                        
    They aren't equal, at the secondary level. And why? Well,
    precisely because the first two strings are canonically
    equivalent to the <hiriq, patah> sequence, while the last
    string, with the CGJ, is the representation for the desired
    <patah, hiriq> sequence.

    This is, of course, precisely the desired result -- the CGJ is
    ignored for weighting, but its presence prevents the reordering
    of the vowels into the undesired sequence by normalization.
    And the resultant weighted key weights the vowels in the correct
    order.

    Tailoring of the collation table could modify any of this, but
    the above example is what you get just using the default table.

    But it is important that people implementing searching and sorting
    for Hebrew understand why and how the CGJ is "ignored" in this
    context, in order to get correct results. For example, if you
    strip the CGJ and *then* hand the string to the collation weighting
    algorithm, normalization will again rearrange the points into
    the wrong order for weighting.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jul 28 2003 - 22:37:31 EDT