Re: Merging combining classes, was: New contribution N2676

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Oct 27 2003 - 08:54:13 CST


Thanks a lot for thzese precisions on Hebrew usages that need those
combining order overrides.
This demonstrates that this occurs relatively infrequently, and so
introducing a ignorable "combining order override" control makes sense,
without needing to add duplicate codepoints with corrected properties.

What is important here is whever the lack of this ovveride or separate
codepoint makes the text ambiguous. With your comments, I see that the
Hebrew logical order may not always need to be respected in the encoded
string, provided that the character identity (for example the sin letter) is
preserved, according to users expectations (notably if a combined character
is mapped on the common keyboard).

I would then say that the Hebrew language should need to represent grapheme
clusters as:
- a logical combining sequence for the initial consonnant and its modifier
(like shin dot)
- then the logical combining sequences for each extra vowel sign with their
accuentation.

The problem here is that consonnant modifiers, vowels and accents in Hebrew
are all encoded as combining characters, but each subgroup belong to
combining classes whose value ranges are overlapping. With the current
model, only 1 combining sequence can be encoded, without sub-hierarchy. If
only the Hebrew vowels had been encoded as separate base characters instead
of combining characters, we would not have this problem, as they would
initiate their own combining sequence.

That's where a CCO (combining class override) control character (CGJ or
other) can help: it can be used to force a missing and separate base
character for vowels, notably for the second vowel group, but also for the
consonnant modifier (shin dot) if it is followed by a vowel group.

We won't change the combining classes. And we won't reform the normalization
rules as defined for NF* conformance. But we can add further normalization
steps for Hebrew, describing the correct use of the combining order
overrides, and that correctly reorders all the combining characters after
the initial consonnant, to generate the correct logical order. And we can
make font renderers accept this new encoding, by letting them recognize the
CCO.

----- Original Message -----
From: "Peter Kirk" <peterkirk@qaya.org>
To: "John Hudson" <tiro@tiro.com>
Cc: <unicode@unicode.org>
Sent: Monday, October 27, 2003 1:48 PM
Subject: Re: Merging combining classes, was: New contribution N2676

> On 26/10/2003 19:58, John Hudson wrote:
>
> > ...
> > Functionally, inserting a CGJ here resolves the problem fine. I'm just
> > not convinced that CGJ is a good general solution to the normalisation
> > problem: it works, but it requires deliberate insertion in every place
> > where unwanted mark re-ordering may occur. If I have some free time
> > over the next while, I'll try to figure out just how many places in
> > the Bible text this would be needed: I suspect it is quite a lot. Of
> > course, if you insert automatically CGJ after every mark, you are are
> > sure that re-ordering will not take place, but you also lose any
> > benefit of normalisation.
> >
> > John Hudson
> >
> CGJ is likely to be needed:
>
> 1) whenever two vowels come together in non-canonical order:
> approximately 638 times in the WTS eBHS text of the Hebrew Bible (over 5
> MB of UTF-8), with little variation in other texts - all but two of
> these cases are in Yerushala(y)im;
>
> 2) according to my proposal, for every occurrence of right meteg:
> approximately 905 times in eBHS but with a potentially large variation
> between texts;
>
> 3) possibly also for every occurrence of medial meteg: approximately 78
> times in eBHS.
>
> Philippe made a good point that the ordering of combining characters
> relative to CGJ needs to be constrained, as a spelling convention
> because it cannot be by normalisation. But the ordering here should be
> related to the logic of the language.
>
> In the case of Yerushalayim, the second vowel is somehow auxiliary and
> relates to an omitted consonant, whereas the first vowel and the accent
> (often but not always present) go with the lamed which is written. So in
> this case the appropriate order is <base character, vowel1, accent, CGJ,
> vowel2>. In the odd case of two vowels and two accents on one base
> character in Exodus 20:4 (see
> http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html section
> 3.2), the most logical order is actually <base character, vowel1,
> accent1, CGJ, vowel2, accent2>, because the second accent (geresh) goes
> with the second vowel (patah).
>
> The situation is rather different for right meteg, if CGJ is used for
> this, as it is always written to the right of all other combining marks
> and the other marks are in their regular positions. So the most logical
> ordering would be <base character, meteg, CGJ, vowel, accent>.
>
> --
> Peter Kirk
> peter@qaya.org (personal)
> peterkirk@qaya.org (work)
> http://www.qaya.org/
>
>
>



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST