Re: Merging combining classes, was: New contribution N2676

From: Philippe Verdy (
Date: Mon Oct 27 2003 - 12:16:59 CST

From: "Peter Kirk" <>
> I am not sure what you mean by "further normalization steps for Hebrew".

Of course I don't mean that NF* algorithms must be changed. See below.

> If this means that users will be expected to input Hebrew in this order,
> perhaps with a keyboard driver which inserts the necessary CGJs, this is
> good. But I don't think it is reasonable to expect software producers to
> add an extra layer to their software specifically for Hebrew, especially
> when now they are refusing to add such a layer with more general
> applicability when specifically required to do so in the standard.

What I mean is an optional, recommanded way to encode the extra
CGJs if one wants to create text in logical order, while still maintaining
a NF* normalized form.

The result of this extra step would of course not be canonically equivalent
with the original string, but the result would still be in canonical order,
and thus preserved later if an application just applies the NF* algorithms,
or applies also the optional steps.

I am not saying that CGJ is the way to go. I proposed a general encoding
scheme allowing existing scripts to better fit linguistic requirements, and
in which a canonical ordering override system could be used under strict

My opinion is that CGJ currently is currently not defined to require or
recommand these restrictions, and so either a new CCO control
or script-specific combining character holders used as base could be
created (after all this is what has been done and standardized since long
in Brahmic scripts which do not have this problem, except in Tibetan
where such encoding problems are still persisting).

My feeling is that a per-script CCO control would be easier to handle
and understand, notably in BiDi contexts, or with scripts with complex
layouts like Tibetan, or some visual signs languages, and possibly
other still unencoded old Amerindian or Central Asian, Sumerian or
Phenician scripts. Who knows?

The current model is based on a single encoding sequence to represent
a base character and a set of combining characters with ordered classes.
As we have seen, this model is probably too simplist to cover all
languages and scripts, and may be there's a need to describe combining
sequences in subgroups. With the current system, the only way to create
sub-groups is to have them leaded by a base character with CC=0, even
if there's no such character in the script to represent.

The notable exception to this model is the Hangul script which was
standardized with this idea of encoding sequences subgroups (the LVT
model) to create more useful grapheme clusters made of several
combining sequences, each one belonging to a well-defined LVT type.
This model has worked, despite Korean has also a complex layout, because
some groups of double (SANG) consonnants have been assigned distinct
codepoints even though they are inherently made of multiple jamos.
This has multiplied the number of codepoints needed to represent the
Hangul script, but it was encoded this way in Unicode because of the
preexisting KSC standard which had made this simplification to limit
the complexity of layout renderers. As the script is not used for
another language than modern Korean, this is not a problem, and so
Korean does not need now a system to override the combining

What I propose is not to reform the model: we keep the combining
sequences as they are, and the NF normalization. However we give
opportunity to group several combining sequences (normalized or
not) in a order that the NF normalizations will not alter (they will
only alter the individual sequences that make the grapheme cluster).

Using a single CCO control character code (like CGJ) may cause
various implementation problems for renderers, notably in a text
with mixed scripts of varying directionality or layout. It may help
them if each script (or group of scripts with similar layout, like
Latin, Greek, Cyrillic) contained a CCO control having each one their
own layout rules. The only common thing in them would be that
these characters would have the common property of being of
combining class 0 (used as base character), no associated glyph
(they are controls) and linking the combining sequence in which
they are used with the previous combining sequence, to create
a single grapheme cluster (provided that their respective script
are compatible with this model).

Using a single CGJ for all scripts may cause problems in renderers
as it would require forward and backward parsing of the string to
determine if both combining sequences are compatible with each
other, in a way that fits the text layout engines. I don't say it's
impossible, but it may be harder to implement in these renderers.
This also applies to text searches, as CGJ carries absolutely no
script property by itself.

With this model, for example, it would be possible to create
Hebrew text in which logical order of letters is guaranteed.
(this may be a big benefit in applications like editors that
need to be able to work on individual vowel-groups or
consonnant groups.) It would facilitate the implementation
of input methods, and the resulting text could then be
handled internally with separate vowels, and later optimized
at save time, by removing unnecessary CCO controls. This gives
a lot more freedom to handle input/editing/rendering of text.

Anyway, the following Hebrew text:
    <base consonnant><sin dot><CCO><vowel><accent>
can be parsed efficiently into these 2 combining sequences:
    <base consonnant><sin dot>
easily entered by input methods that always generate a CCO
control before a vowel. The resulting (unoptimized) text is still
in NF form and can be sent and retreived to processes that will
normalize strings on input, so the logical text is still preserved.
Then a simplification pass can remove "safely" the unnecessary
CCO controls (CGJ if that's what is used), provided that the
logical order is preserved (the CCO above will not be removed
as it would create a <sin dot><vowel> sequence that a
standard normalizer would reorder, possibly altering the meaning
of the entered text).

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:25 CST