Re: CGJ - Combining Class Override

From: Philippe Verdy (
Date: Sat Oct 25 2003 - 09:58:27 CST

From: "Jony Rosenne" <>

> For the record, I repeat that I am not convinced that the CGJ is an
> appropriate solution for the problems associated with the right Meteg. I
> tend to think we need a separate character.

Yes, it's possible to devize another character explicitly to override
very precisely the ordering of combining classes. But this still
does not change the problem, as all the existing NF* forms in
existing documents using any past or present version of Unicode
MUST remain in NF* form with further additions.

If one votes for a separate control character, it should come with
precise rules describing how such override can/must be used, so
that we won't break existing implementations. This character will
necessary have a combining class 0, but will still have a preceding
context. Strict conformance for the new NF* forms must still obey
to the precise ordering rules, and this character, whatever its form,
shall not be used everytime it is not needed, i.e. when the existing
NF* forms still produce the correct logical order (that's why its
use should then be restricted to a list of known combining
characters that may need this override).

Call it <CCO> "Combining Class Override" ? This does not change
the problem: this character should be used only between pairs
of combining characters, such as the encoded sequence:
    {c1, CCO, c2}
shall conform to the rules:
    (1) CC(c1) > CC(c2) > 0,
    (2) c1 is known (listed by Unicode?) to require this override
    to keep the logical ordering needed for correct text semantics.

The second requirement should be made to avoid abuses of this
character. But it is not enforceable if CGJ is kept for this function.

The CCO character should then be made "ignorable" for
collation or text breaks, so that collation keys will become:
    [ CK(c1), CK(c2) ] for {c1, CCO, c2}
    [ CK(c2), CK(c1) ] for {c2, c1} and {c1, c2} if normalized

Legacy applications will detect a separate combining sequence
starting at CCO, but newer applications will still know that both
sequences are describing a single grapheme cluster.

This knowledge should not be necessary except in grapheme
renderers, or in some input methods that will allow users to
    (1) keys <c2><c1> producing the normalized text {c2, c1}
         as before;
    (2) keys <c1><c2> producing the normalized text {c1, CCO, c2}
         instead of {c2, c1} as before;
    (3) optionally support a keystroke or selection system to swap
         combining characters.

If this is too complex, the only way to manage the situation is
to duplicate existing combining characters that cause this problem,
and I think this may go even worse as this duplication may need
to be combinatorial and require a lot of new codepoint assignments.

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST