Re: Yerushala(y)im - or Biblical Hebrew

From: Peter Kirk (
Date: Tue Jul 08 2003 - 09:18:33 EDT

  • Next message: Philippe Verdy: "Re: UTF-8 to UTF-16LE"

    On 08/07/2003 02:23, Peter Kirk wrote:

    > Would it work to define a new character, for example, for patah-hiriq
    > which has a canonical decomposition into patah plus hiriq, or even
    > into hiriq plus patah? Would normalisation compose a patah-hiriq
    > sequence into this character and so get round the reordering problem?
    > Remember that the reverse sequence is actually not attested, as far as
    > I can tell for any of the sequences in question.
    A couple of off list comments have made it clear to me that this
    proposal needs some clarification and adjustment. But I think it can
    still be made to work. It is a nasty kludge, but then as someone pointed
    out any solution to this problem is bound to be a nasty kludge. In some
    ways it is less nasty than others that have been suggested, and it
    doesn't have some of the disadvantages that have been mentioned. It also
    has the advantage that no recoding of existing text is required. That
    doesn't make it my preferred solution (the CGJ solution is still that),
    but it is at least worth considering.

    This solution requires adding a new character for each vowel sequence
    found in Hebrew texts. Currently six such sequences have been identified
    in the WTS Bible text - though one of these (sheva-hiriq) is already in
    canonical order and so not a problem. So this is fewer new characters
    than the earlier proposal - but there may be other sequences in other
    texts. This relies on the fact that none of these sequences are found in
    reverse, although we cannot guarantee that this is true for all texts. I
    will use the patah-hiriq sequence as an example, all other sequences
    solved separately in the same way.

    The solution for this sequence is as follows: Define a new combining
    character something like HEBREW LIGATURE PATAH HIRIQ with a canonical
    decomposition of hiriq - patah (yes, that way round) and a glyph with a
    hiriq to the left of a patah. How does this help? Well, it will not
    affect users who type patah then hiriq, in non-canonical order, into an
    application which does not immediately normalise the text, as the
    renderer will still render hiriq to left of patah as typed. But when
    this text is normalised into NFC, the sequence will first be reordered
    as hiriq - patah, and then this combination will be composed into the
    new ligature. That is correct, isn't it? So an application which renders
    the NFC text will see the new character and should render it according
    to its glyph. In NFD text, the hiriq - patah sequence remains, but it
    is, I think, customary if not required for the renderer to combine the
    glyphs into the defined ligature before rendering. So in every case the
    end user sees hiriq to the left of patah, although in fact the
    underlying encoding is reversed.

    Have I missed anything vital here? I know that more study may be needed
    of interaction with cantillation marks, some of which can appear between
    the patah and the hiriq.

    Of course we could simply store the reversed order without defining a
    new character. But renderers would then need clear instruction somewhere
    in the Unicode text that, as an exception to the normal rules for
    rendering multiple diacritics, the hiriq should be positioned to the
    left of the patah and similarly for the other attested sequences.

    Peter Kirk

    This archive was generated by hypermail 2.1.5 : Tue Jul 08 2003 - 10:09:57 EDT