Re: Merging combining classes, was: New contribution N2676

From: Peter Kirk (
Date: Mon Oct 27 2003 - 09:49:47 CST

On 27/10/2003 06:54, Philippe Verdy wrote:

>Thanks a lot for thzese precisions on Hebrew usages that need those
>combining order overrides.
>This demonstrates that this occurs relatively infrequently, and so
>introducing a ignorable "combining order override" control makes sense,
>without needing to add duplicate codepoints with corrected properties.
>What is important here is whever the lack of this ovveride or separate
>codepoint makes the text ambiguous. With your comments, I see that the
>Hebrew logical order may not always need to be respected in the encoded
>string, provided that the character identity (for example the sin letter) is
>preserved, according to users expectations (notably if a combined character
>is mapped on the common keyboard).
>I would then say that the Hebrew language should need to represent grapheme
>clusters as:
>- a logical combining sequence for the initial consonnant and its modifier
>(like shin dot)
>- then the logical combining sequences for each extra vowel sign with their
>The problem here is that consonnant modifiers, vowels and accents in Hebrew
>are all encoded as combining characters, but each subgroup belong to
>combining classes whose value ranges are overlapping. With the current
>model, only 1 combining sequence can be encoded, without sub-hierarchy. If
>only the Hebrew vowels had been encoded as separate base characters instead
>of combining characters, we would not have this problem, as they would
>initiate their own combining sequence.
>That's where a CCO (combining class override) control character (CGJ or
>other) can help: it can be used to force a missing and separate base
>character for vowels, notably for the second vowel group, but also for the
>consonnant modifier (shin dot) if it is followed by a vowel group.
>We won't change the combining classes. And we won't reform the normalization
>rules as defined for NF* conformance. But we can add further normalization
>steps for Hebrew, describing the correct use of the combining order
>overrides, and that correctly reorders all the combining characters after
>the initial consonnant, to generate the correct logical order. And we can
>make font renderers accept this new encoding, by letting them recognize the
Thank you for the interesting thoughts. As I understand your suggestion,
and bearing in mind that dagesh (and the rare rafe) are also consonant
modifiers, you are effectively suggesting an order (already normalised):

consonant dagesh rafe shin/sin-dot CGJ right-meteg CGJ vowel accent CGJ
vowel2 accent2

with each element being optional, and CGJ being omitted when it is at
the beginning or the end of the string of combining marks, or doubled.

This would, I think, work, and at least come close to being rendered
correctly with current fonts modified to ignore CGJ (which actually they
should do anyway as CGJ is default ignorable). The down side is the
large number of CGJ's required. Dagesh occurs 171701 times in the Hebrew
Bible (eBHS), shin dot 46277 times, and sin dot 12128 times. As this
proposal would require CGJ to be added after any group or one or more of
these together, followed by a vowel (nearly always present) or an
accent, the effect of this proposal is that CGJ would have to be used
nearly 200,000 times in the Hebrew Bible, instead of just over 1000
times. This is not in itself a reason to reject the idea, but it does
undermine your initial argument in favour of CGJ.

I am not sure what you mean by "further normalization steps for Hebrew".
If this means that users will be expected to input Hebrew in this order,
perhaps with a keyboard driver which inserts the necessary CGJs, this is
good. But I don't think it is reasonable to expect software producers to
add an extra layer to their software specifically for Hebrew, especially
when now they are refusing to add such a layer with more general
applicability when specifically required to do so in the standard.

Peter Kirk (personal) (work)

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST