Re: [hebrew] Re: Hebrew composition model, with cantillation marks

From: Philippe Verdy (
Date: Fri Oct 31 2003 - 12:47:19 CST

After careful analysis of the rendering versus canonical ordering problem of
dagesh/rafe/varika after shin/sin dots, and before vowels, I may conclude
that the proposal is completely not needed for rendering, as the
existing encoding already complies with Biblical Hebrew with exactly the
same rendering issues than the existing code points for shin/sin dots and

What the proposal on solves is just the logical (semantic) order of
analysis of text, but in fact it has exactly the same rendering capabilities
for the possible graphical interactions when positioning diacritics around
the base letter.

In the existing assignment of codepoints, all cantillation marks are already
given a high combining class that this proposal does not change (despite
this is the only place where possible interaction problems exist, and only
in the eventual case of multiple cantillation marks on the same base
consonnant letter).

If you just look at the positioning properties of other Hebrew diacritics,
and if you just class them into one of the 7 positioning areas:

    central, above, above-left, above-right, below, below-left, below-right

Then possible interpretation problems will occur only if two diacritics that
share the same positioning must be ordered. However such collision of
positioning only occur in these cases:

1) There are several vowel groups on the same consonnant (a vowel group is a
single vowel, with its eventual meteg, and cantillation marks): the
reassignment of new codepoints with different combining classes for vowels,
and sin/shin dot (which belong to the consonnant group) will not change
anything in this encoding issue. This is the case where a CGJ may be needed
to separate the vowel groups, which would still be needed with the proposal
on if there are any meteg-sillouq or cantillation mark to keep
together with the correct vowel. This case appears for example in
Yerushala(y)im, as a consequence of a missing consonnant Yod (implied by the
reader but not actually written), but the suggestion of CGJ between vowel
groups in which the canonical ordering would cause problem solves it

2.a) Between sin dot (currently class 24) which is a consonant modifier that
belongs to the consonnant group, and point holam (currently class 19) which
is a vowel modifier: they share the same positioning area ("above-left"), so
this may create an ambiguity. However, they use exactly the same glyph, so
the rendered order is not significant, and the interpretation of either
rendered dot glyph will not matter. This seems like if sin dot and point
holam were duplicate code points, and that this is the reader that
interprets the dot above-left as a sin-dot consonnant modifier (if above a
base shin letter) or as a holam-point vowel. This is the writer that chooses
to encode either codes according to this interpretation. In practice, the
sin-dot is entered simultaneously with the shin base letter as a precombined
character (even if a Unicode normalization decomposes it), and the holam is
entered separately above another base consonnant letter. The distinction
between sin-dot and point-holam in Unicode code points is not relevant for
rendering, but only for semantic analysis. It could be possible to use a
sin-dot/point-holam folding without affecting the effective rendering. For
this reason, I suggest that Unicode recommands to font authors to not make
any distinctions in the rendered glyph for either code points, which is a
single dot positioned above-left.

2.b) within the set of additional accents positioned above-left, they are
incorrectly mapped on class 230 (segol, pashta, pazer, telisha qetana) and
only U+05AE accent tzinor is correctly mapped on class 228. Unless there's a
demonstration that multiple occurences of these marks are needed for the
same vowel (or implied vowel), this is not a problem. They are correctly
ordered in normalized form after the sin-dot consonnant modifier and the
point-holam vowel. Note that the proposal on does not change the
canonical reordering between sin-dot (or point-holam) and these above-left
accents, but it still keeps the inconsistant classes used for these accents.

3) For the position area "above", between U+05BF rafe (currently class 23)
or U+FB1E varika (currently class 26) which are consonnant modifiers and any
mark that alter the vowel(+meteg) in the following set:
    U+0593 accent chalshelet
    U+0594 accent zaqef qatan
    U+0595 accent zaqef gadol
    U+0597 accent ravia (alias "revia")
    U+0598 accent zarqa (alias "tsinorit")
    U+059C accent geresh (alias "gerich")
    U+059E accent gershaym (alias "shene grishin")
    U+059F accent qarney para (alias "karne farah")
    U+05A8 accent qadma (alias "azla")
    U+05AB accent oleh
    U+05AC accent iluy (alias "ilouz")
    U+05AF mark massora circle
    U+05C4 mark upper dot
and which all are currently assigned to the combining class 230. The
canonical ordering does not cause any problem here. And the proposal on does not reassign any of them. There could be a problem if
rafe/varika needed to be rendered in a significant order on hte right or
left or top of the above accents. But I have not found any occurences of
this in the proposal. There would also exist a possible problem if
both rafe and varika needed to be rendered, however the usage pattern of
these seem distinct as they are considered as variant of each other, one for
Judeo-Spanish, the other for traditional Hebrew.

4) For the position area "above-right", between the U+05C1 shin-dot
(currently on class 24, and proposed on class 10 in the PDF on
consonnant modifier and the two following marks:
    U+059D accent geresh muqdam
    U+05A0 accent telisha gedola (alias "talsha")
which are altering the vowel (+meteg) and mapped on class 230. The PDF
proposal on does not modify their mutual order in the case of
normalization so there's no need to reencode shin-dot.

5) For the position "below", between all vowels (except holam) currently
mapped on classes 10 to 20 (and that the proposal wants to duplicate with
new codepoints on class 220), the meteg vowel modifier (currently mapped on
class 22, and duplicated on class 220 in the proposal), and any of
the following marks (that the proposal do not propose to change as they are
already in class 220):
    U+0591 accent etnahta (alias "atnah")
    U+0596 accent tipeha (alias "tarha")
    U+059B accent tevir
    U+05A3 accent munah (alias "chofar holekh")
    U+05A4 accent mahapakh (alias "chofar mehouppak")
    U+05A5 accent merkha (alias "yored" or "marikh")
    U+05A6 accent merkha kefoula (alias "tere tame")
    U+05A7 accent darga
    U+05AA accent yerah ben yomo (alias "gagal" or "yareah ben yomo")
The proposed reencoding of vowels except holam and of meteg has the effect
of making the encoding order of all these characters significant. However,
the existing encoding in Unicode correctly orders the meteg vowel modifier
after all vowels, and the other marks above after meteg. What is worse is
that it removes the canonical ordering which is best to help unifying
equivalent strings (or strings that are incorrectly encoded with a mark
before the vowel). The only merit of this proposal is that it allows to make
the distinction in the case where the meteg vowel modifier needs to be
positioned before the vowel mark (right meteg) instead of after it (left
meteg). But it still does not solve the problem of the correct way to encode
the medial position meteg.

I think that the Unicode standard should specify clearly what happens in
case of multiple Hebrew diacritics sharing the same positioning area above
or below the letter: they must be rendered side-by-side, right-to-left,
except for position below (vowels, meteg and a few accents) which would
preferably rendered left-to-right in case of collision of diacritics.

In summary, the PDF document on and signed by Microsoft and a few
others does not solve any significant rendering problem, but only the case
of the right-meteg. It is not worth the effort of reencoding all these
vowels and sin/shin dots, because the problem to solve is not there. The
only merit is that it adds a rendering support for the right meteg only. And
I think that we can keep the existing vowels and sin/shin dots as they are,
and instead specify how the right or medial meteg should be encoded (by
adding for example only the right and medial meteg as new code points).

The idea behind this document is based on performance issues to encode
Hebrew in logical order, but a renderer can simply use its own normalizer
using the positioning classes I document here, to help render the strings as
they should be. Or to collate them correctly (base consonnant, consonnant
modifiers daguesh/rafe/varika, vowels, meteg, accents and marks) according
to their vizual interactions...

All this analysis is more clear if one reads the last Excel sheet I posted
which show the relative visual interactions between Hebrew characters. What
I mean here, is that Unicode combining classes are still relevant for
normalization, and that string collation or rendering should use instead its
own combining classes according to one of the 7 position areas for Hebrew
diacritics with a algorithm exactly copied from the standard normalization.

All what is needed is then a way to encode the right and medial meteg, and I
think this would be better to just add codepoints for this case, so that the
set of diacritics in position "below" can be ordered correctly for rendering
and semantic needs. In the interim, a CGJ control can be used to force the
correct order or to build a medial meteg by splitting the compund hataf
vowel into its two parts and using CGJ after the first vowel part plus meteg
and before the second vowel part.

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST