Re: Hebrew composition model, with cantillation marks

From: Peter Kirk (
Date: Wed Oct 29 2003 - 06:13:14 CST

Thank you, Philippe. I include the full text of your posting plus the
attachment for the benefit of those on the Unicode Hebrew list who have
missed out on this. Some of the issues here have already been discussed
on that list. Also I wonder if you have seen,
which includes a good analysis of the issues and proposes new characters
with more suitable combining classes. The problem with that proposal was
not with the technical details but that the principle of using a
separate encoding for biblical Hebrew is unacceptable. That problem
would be avoided if these new characters (with names adjusted) were used
for all pointed Hebrew and the existing characters deprecated. But there
are other reasons which make that suggestion difficult to accept -
although there is probably not much existing pointed modern Hebrew text.

See some further comments on the details below.

On 28/10/2003 18:49, Philippe Verdy wrote:

>I just finished an Excel speadsheet that shows the Hebrew composition model,
>and all the problems caused by the canonical order of Hebrew diacritics.
>In summary, most problems come from consonnant modifiers which have a
>combining class higher than vowels or vowel modifiers.
>If vowels had been assigned a null combining class, such problems would have
>not appeared. The idea of generating a CGJ before all vowels in input
>methods (and then let a prenormalization process remove unnecessary CGJ in
>composed strings) seems interesting, as it forces vowels to behave like base
>characters, but it does not solve all the problem, but only the ordering
>problem caused by the wrong combining classes 21, 24 and 25 assigned
>respectively to DAGESH/MAPIQ, SHIN DOT and SIN DOT, that come logically
>before the vowels (in classes 10 to 20), or vowel modifiers (classes 22, 23
>and 26).
Actually rafe, in class 23, and varika, class 26 but not used in Hebrew,
should be considered consonant modifiers. Rafe basically indicates the
absence of dagesh, and so these two fit in the same logical class. The
only vowel modifier in this sense is meteg. But meteg is best considered
as an accent, although it is sometimes used in texts which are not
otherwise accented. Typographically, the only ways in which meteg
differs from other accents are that it can appear to the right of a low
vowel or in the middle of one of the hataf vowels.

Note that, as an exception to the neat rules here, when vav with shuruq
is used as a vowel, i.e. the vowel is a separate base character, any
meteg or accent is attached not to the vav with shuruq but to the
consonant. For the accents are more syllable modifiers than vowel
modifiers. But it is most sensible for a number of reasons to continue
to order them after the combining vowels.

>We could specify a rule for inserting CGJ only when it is useful:
> - before (any vowel, vowel modifier or cantillation mark) if it follows
A good rule. This will greatly simplify rendering and collation. But it
does need to be agreed by all.

> - just before a second vowel on the same consonnant (in that case it
>plays the role of the "missing consonnant" in Yerushala(y)im). But this
>requires some more specific rules to remove other superfluous CGJ, as it is
>not always needed: this depends on the relative combining classes of the
>corresponding vowels (in classes 10 to 20). See below when this is needed.
Well, if this CGJ is inserted by a keyboard utility or a code conversion
routine, these more specific rules can be programmed. But in practice
only a very small number of superfluous CGJs will be added if this rule
is used unmodified; I found only one case in the whole Hebrew Bible of
two vowel points on one base character which happen to be in canonical

Another use for CGJ which you have not specified is to ensure proper
positioning of meteg, to the right or left of vowels or accents. (Medial
meteg needs a different mechanism.)

>Another solution could be to duplicate these 3 consonnant modifiers, so:
> NEW DAGESH/MAPIQ: class 10 (central position)
> NEW SHIN DOT: class 11 (above-right position)
> NEW SIN DOT: class 12 (above-left position)
> and remap all vowels and vowel signs starting at class 13 and higher...
>It would be appropriate in that case to not name them "POINT", but "LETTER
>Also I note that mosts usages of dagesh/mapiq, shin dots and sin dots with
>base consonnants were mapped in Unicode by encoding precomposed consonnants.
>The bad thing is that they are canonically decomposed and prohibited from
>If one had encoded and used the original text with a legacy Hebrew encoding
>which included these precomposed characters, without considering the case of
>Unicode-specific normalizations, there was no such problems. So the problem
>has been introduced by Unicode, which made them canonical decompositions in
>all NF forms, instead of defining them only for NFK*. Worse, the canonical
>composition exclusions are blocking us from using these precomposed
>characters in a NFC text.
There would indeed have been fewer problems if these precomposed forms
had not been specified as decomposition exclusions. Well, it would have
enabled rendering of NFC Hebrew by non-compliant rendering engines i.e.
ones which don't render all canonically equivalent sequences the same.
It would not have simplified the collation issue as collation is based
on NFD.

>Assigning new codepoints could ease the transcription of texts from legacy
>Hebrew encodings (including the Windows Hebrew and ISO Hebrew character
>sets) to Unicode without experimenting all these common problems: this would
>affect the mapping to Unicode of these legacy charsets, but certainly it
>would be beneficial in the long term.
>There are however two more subtle problems:
>1) Within the set of vowels U+05B0 to U+05B9 (classes 10 to 20):
> They are all combining with a position "below", except U+05B9 POINT
>HOLAM (class 19).
> The U+05BB POINT QUBUTS vowel (class 20) is not grouped along other
>"below" vowels. In fact the canonical ordering attempts to force a unique
>order for all vowels, which does not take their real layout combining
> In reality, these vowels should have been given only 2 possible
>combining classes, such as 13 (position below) for all vowels, except POINT
>HOLAM which would have class 14 (or could be kept at its existing class 19,
>position above-left).
> The ordering problem can be solved using a CGJ before the U+05B9 POINT
>HOLAM vowel (class 19, above-left) if it needs to follow the U+5BB POINT
>QUBUTS vowel (class 20, below).
This would never be necessary. As holam really does not interact
typographically with low vowels, there can be no significance in their
relative ordering, and it is appropriate that they have different
combining classes. The problem is that the various low vowels have
different combining classes although they do interact typographically,
in breach of the standard itself. That is why CGJ often needs to be
inserted between vowel pairs.

> The alternative would be to encode a new POINT HOLAM or a new POINT
>QOUBOUTS with a more correct class that respects the combinings groups in
>the Hebrew script.
>2) Within cantillation marks (U+0591 to U+05AF, plus U+05CA MARK UPPER DOT):
> The accents coded with class 220 (below), 222 (below-right), 228
>(below-left) have no problem.
228 is actually above-left, surely.

> However the remaining 19 accents and marks at class 230 (above) do not
>belong to the same combining category, the problems are for these 5
> U+059D ACCENT GERESH MUQDAM (alias "gerich mouqdam")
> U+05A0 ACCENT TELISHA GEDOLA (alias "talchah")
> which are combining at position above-right, and
> U+0599 ACCENT PASHTA (alias "qadma")
> U+05A9 ACCENT TELISHA QETANA (alias "tarsa")
> which are combining at position above-left.
There is also a problem with U+0592 which is also positioned above-left,
at least in many texts, but is in class 230. But U+05A1 is usually
centred above. See for
a useful summary of accent positions. But these positions vary - and
accent names vary even more.

> This is not strictly a problem to keep the semantic of text, as they
>share the same combining class, and so the normalization process will not
>reorder them. ...
There is a potential problem in that U+05AE, also positioned above left
(although wrongly shown as below left in your chart), is in class 228,
and does interact typographically with the other above left accents
which are in class 230. But this is probably of theoretical importance
only, and CGJ can be used if really necessary.

>... But it still prevents a more complete normalization that
>considers the case of these 5 accents which may (should?) be reordered.
>However, the case of multiple cantillation marks on the same vowel may be
>quite rare even in historic texts (but I don't have a copy of the large
>liturgic Hebrew texts to verify this.)
I posted before the results of my analysis of the rare cases of multiple
accents in the Hebrew Bible. The only cases in which there was a
potential normalisation issue were combinations of meteg and other
accents. Accents are very rarely used in other texts.

>If one is interested, I attach my Excel sheet which makes all this more
>vizual, and that includes also the existing decompositions of compatibility
>characters (U+FBxx), shown in italic rows.
>The table is ordered by logical semantic and grouping. The combining classes
>that cause problems are shown with bold white on red squares, and the
>positioning constraints partly explained in the Unicode reference chapter
>can better be explained by looking at the positioning columns in the table.
>If there remains errors in this table, please don't shout me too much...
Thank you. I have pointed out above a few small errors of detail, but
the principle is good.

Peter Kirk (personal) (work)

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST