Re: Hebrew composition model, with cantillation marks

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 28 2003 - 20:49:46 CST


I just finished an Excel speadsheet that shows the Hebrew composition model,
and all the problems caused by the canonical order of Hebrew diacritics.

In summary, most problems come from consonnant modifiers which have a
combining class higher than vowels or vowel modifiers.

If vowels had been assigned a null combining class, such problems would have
not appeared. The idea of generating a CGJ before all vowels in input
methods (and then let a prenormalization process remove unnecessary CGJ in
composed strings) seems interesting, as it forces vowels to behave like base
characters, but it does not solve all the problem, but only the ordering
problem caused by the wrong combining classes 21, 24 and 25 assigned
respectively to DAGESH/MAPIQ, SHIN DOT and SIN DOT, that come logically
before the vowels (in classes 10 to 20), or vowel modifiers (classes 22, 23
and 26).

We could specify a rule for inserting CGJ only when it is useful:

    - before (any vowel, vowel modifier or cantillation mark) if it follows
(DAGESH/MAPIQ, SHIN DOT or SIN DOT),

    - just before a second vowel on the same consonnant (in that case it
plays the role of the "missing consonnant" in Yerushala(y)im). But this
requires some more specific rules to remove other superfluous CGJ, as it is
not always needed: this depends on the relative combining classes of the
corresponding vowels (in classes 10 to 20). See below when this is needed.

Another solution could be to duplicate these 3 consonnant modifiers, so:
    NEW DAGESH/MAPIQ: class 10 (central position)
    NEW SHIN DOT: class 11 (above-right position)
    NEW SIN DOT: class 12 (above-left position)
    and remap all vowels and vowel signs starting at class 13 and higher...
It would be appropriate in that case to not name them "POINT", but "LETTER
MODIFIER"

Also I note that mosts usages of dagesh/mapiq, shin dots and sin dots with
base consonnants were mapped in Unicode by encoding precomposed consonnants.
The bad thing is that they are canonically decomposed and prohibited from
recomposition.

If one had encoded and used the original text with a legacy Hebrew encoding
which included these precomposed characters, without considering the case of
Unicode-specific normalizations, there was no such problems. So the problem
has been introduced by Unicode, which made them canonical decompositions in
all NF forms, instead of defining them only for NFK*. Worse, the canonical
composition exclusions are blocking us from using these precomposed
characters in a NFC text.

Assigning new codepoints could ease the transcription of texts from legacy
Hebrew encodings (including the Windows Hebrew and ISO Hebrew character
sets) to Unicode without experimenting all these common problems: this would
affect the mapping to Unicode of these legacy charsets, but certainly it
would be beneficial in the long term.

There are however two more subtle problems:

1) Within the set of vowels U+05B0 to U+05B9 (classes 10 to 20):

    They are all combining with a position "below", except U+05B9 POINT
HOLAM (class 19).

    The U+05BB POINT QUBUTS vowel (class 20) is not grouped along other
"below" vowels. In fact the canonical ordering attempts to force a unique
order for all vowels, which does not take their real layout combining
properties.

    In reality, these vowels should have been given only 2 possible
combining classes, such as 13 (position below) for all vowels, except POINT
HOLAM which would have class 14 (or could be kept at its existing class 19,
position above-left).

    The ordering problem can be solved using a CGJ before the U+05B9 POINT
HOLAM vowel (class 19, above-left) if it needs to follow the U+5BB POINT
QUBUTS vowel (class 20, below).

    The alternative would be to encode a new POINT HOLAM or a new POINT
QOUBOUTS with a more correct class that respects the combinings groups in
the Hebrew script.

2) Within cantillation marks (U+0591 to U+05AF, plus U+05CA MARK UPPER DOT):

    The accents coded with class 220 (below), 222 (below-right), 228
(below-left) have no problem.

    However the remaining 19 accents and marks at class 230 (above) do not
belong to the same combining category, the problems are for these 5
characters:
        U+059D ACCENT GERESH MUQDAM (alias "gerich mouqdam")
        U+05A0 ACCENT TELISHA GEDOLA (alias "talchah")
    which are combining at position above-right, and
        U+0599 ACCENT PASHTA (alias "qadma")
        U+05A1 ACCENT PAZER
        U+05A9 ACCENT TELISHA QETANA (alias "tarsa")
    which are combining at position above-left.

    This is not strictly a problem to keep the semantic of text, as they
share the same combining class, and so the normalization process will not
reorder them. But it still prevents a more complete normalization that
considers the case of these 5 accents which may (should?) be reordered.
However, the case of multiple cantillation marks on the same vowel may be
quite rare even in historic texts (but I don't have a copy of the large
liturgic Hebrew texts to verify this.)

If one is interested, I attach my Excel sheet which makes all this more
vizual, and that includes also the existing decompositions of compatibility
characters (U+FBxx), shown in italic rows.
The table is ordered by logical semantic and grouping. The combining classes
that cause problems are shown with bold white on red squares, and the
positioning constraints partly explained in the Unicode reference chapter
can better be explained by looking at the positioning columns in the table.

If there remains errors in this table, please don't shout me too much...





This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST