Re: Merging combining classes, was: New contribution N2676

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Oct 27 2003 - 10:45:21 CST


> Thank you for the interesting thoughts. As I understand your suggestion,
> and bearing in mind that dagesh (and the rare rafe) are also consonant
> modifiers, you are effectively suggesting an order (already normalised):
>
> consonant dagesh rafe shin/sin-dot CGJ right-meteg CGJ vowel accent CGJ
> vowel2 accent2
>
> with each element being optional, and CGJ being omitted when it is at
> the beginning or the end of the string of combining marks, or doubled.
>
> This would, I think, work, and at least come close to being rendered
> correctly with current fonts modified to ignore CGJ (which actually they
> should do anyway as CGJ is default ignorable). The down side is the

There are two very different cases that appear to be conflated by the above
example.

1. Current engines incorrectly rendering canonically equivalent text.

If a rendering engine renders X Y Z correctly, but doesn't render a
canonically-equivalent X Z Y correctly, then there is a problem in the engine.
[Note: this would be for sequences X Y Z that would actually occur in practice.]

Using CGJ for this would simply be a mechanism to get by current deficiencies in
the engines.

2. Unicode not making a distinction between X Y Z and X Z Y.

Where there are cases where canonically-equivalent X Y Z and X Z Y should be
rendered differently, then CGJ could be used to preserve the distinction, as per
the UTC decision:

[96-C20] Consensus: Add text to Unicode 4.0.1 which points out that combining
grapheme joiner has the effect of preventing the canonical re-ordering of
combining marks during normalization. [L2/03-235, L2/03-236, L2/03-234]

[96-A72] Action Item for Ken Whistler: Draft language for consensus 96-C20 (on
the effect of combining grapheme joiner to prevent canonical re-ordering of
combining marks during normalization) for inclusion into Unicode 4.0.1 and
create a FAQ describing this effect as well. [L2/03-235, L2/03-236, L2/03-234]

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "Peter Kirk" <peterkirk@qaya.org>
To: "Philippe Verdy" <verdy_p@wanadoo.fr>
Cc: <unicode@unicode.org>; <hebrew@unicode.org>
Sent: Mon, 2003 Oct 27 07:49
Subject: Re: Merging combining classes, was: New contribution N2676

> On 27/10/2003 06:54, Philippe Verdy wrote:
>
> >Thanks a lot for thzese precisions on Hebrew usages that need those
> >combining order overrides.
> >This demonstrates that this occurs relatively infrequently, and so
> >introducing a ignorable "combining order override" control makes sense,
> >without needing to add duplicate codepoints with corrected properties.
> >
> >What is important here is whever the lack of this ovveride or separate
> >codepoint makes the text ambiguous. With your comments, I see that the
> >Hebrew logical order may not always need to be respected in the encoded
> >string, provided that the character identity (for example the sin letter) is
> >preserved, according to users expectations (notably if a combined character
> >is mapped on the common keyboard).
> >
> >I would then say that the Hebrew language should need to represent grapheme
> >clusters as:
> >- a logical combining sequence for the initial consonnant and its modifier
> >(like shin dot)
> >- then the logical combining sequences for each extra vowel sign with their
> >accuentation.
> >
> >The problem here is that consonnant modifiers, vowels and accents in Hebrew
> >are all encoded as combining characters, but each subgroup belong to
> >combining classes whose value ranges are overlapping. With the current
> >model, only 1 combining sequence can be encoded, without sub-hierarchy. If
> >only the Hebrew vowels had been encoded as separate base characters instead
> >of combining characters, we would not have this problem, as they would
> >initiate their own combining sequence.
> >
> >That's where a CCO (combining class override) control character (CGJ or
> >other) can help: it can be used to force a missing and separate base
> >character for vowels, notably for the second vowel group, but also for the
> >consonnant modifier (shin dot) if it is followed by a vowel group.
> >
> >We won't change the combining classes. And we won't reform the normalization
> >rules as defined for NF* conformance. But we can add further normalization
> >steps for Hebrew, describing the correct use of the combining order
> >overrides, and that correctly reorders all the combining characters after
> >the initial consonnant, to generate the correct logical order. And we can
> >make font renderers accept this new encoding, by letting them recognize the
> >CCO.
> >
> >
> >
> >
> Thank you for the interesting thoughts. As I understand your suggestion,
> and bearing in mind that dagesh (and the rare rafe) are also consonant
> modifiers, you are effectively suggesting an order (already normalised):
>
> consonant dagesh rafe shin/sin-dot CGJ right-meteg CGJ vowel accent CGJ
> vowel2 accent2
>
> with each element being optional, and CGJ being omitted when it is at
> the beginning or the end of the string of combining marks, or doubled.
>
> This would, I think, work, and at least come close to being rendered
> correctly with current fonts modified to ignore CGJ (which actually they
> should do anyway as CGJ is default ignorable). The down side is the
> large number of CGJ's required. Dagesh occurs 171701 times in the Hebrew
> Bible (eBHS), shin dot 46277 times, and sin dot 12128 times. As this
> proposal would require CGJ to be added after any group or one or more of
> these together, followed by a vowel (nearly always present) or an
> accent, the effect of this proposal is that CGJ would have to be used
> nearly 200,000 times in the Hebrew Bible, instead of just over 1000
> times. This is not in itself a reason to reject the idea, but it does
> undermine your initial argument in favour of CGJ.
>
> I am not sure what you mean by "further normalization steps for Hebrew".
> If this means that users will be expected to input Hebrew in this order,
> perhaps with a keyboard driver which inserts the necessary CGJs, this is
> good. But I don't think it is reasonable to expect software producers to
> add an extra layer to their software specifically for Hebrew, especially
> when now they are refusing to add such a layer with more general
> applicability when specifically required to do so in the standard.
>
> --
> Peter Kirk
> peter@qaya.org (personal)
> peterkirk@qaya.org (work)
> http://www.qaya.org/
>
>
>
>



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST