Re: CGJ for variant combining marks, in Hebrew as well as German?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jul 23 2004 - 07:49:10 CDT

Next message: Philipp Reichmuth: "Re: Umlaut and Tréma, was: Variation selectors and vowel marks"

Previous message: Alain LaBonté: "Re: Much better Latin-1 keyboard for Windows"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Peter Kirk" <peterkirk@qaya.org>
> On 17/07/2004 14:33, Philippe Verdy wrote:
> >My opinion would be to place a invisible combining character after the
> >diaeresis or umlaut or precomposed letter, ignorable by default in UCA
but
> >which could be tailored for German. However the UTC has a preference for
> >CGJ+diacritic.
> >
> >I am not sure that CGJ is made to allow such distinctions, and I would
> >prefer a Combining Variation Selector, with combining class 0 like CGJ,
> >encoded after the diacritic or precomposed letter that uses it (for
> >precomposed letters, its semantic would alter the last diacritic coded in
> >the normalized decomposed form, to preserve normalization).
>
> Philippe, I agree with you that such Combining Variation Selectors would
> be the best way to deal with umlaut/tréma and also with variants of
> HOLAM, METEG, QAMATS etc. Actually, the existing variation selectors
> could be used, given that they are already combining characters in
> combining class 0.

I know that, but the existing variation selectors are extremely restricted
in Unicode, as they are usable only for predefined character pairs, for
which they are only coding glyph variations, rather than variation in
semantics, which is what would be needed for german umlauts (implied
supplementary vowel e for collation, treated at primary level) or german
diaeresis/tréma (the vowel isolator mark, rare in German but frequent in
French, to be treated at secondary collation level and for which a combining
variation would be needed), or in Hebrew for specific distinctions (dagesh,
vav, ...)

But this use of existing VSn would cause complications with the UCA
algorithm, which is already tuned to ignore them ALL, and because they are
currently used for now for base characters.

Using CGJ after a combining mark is really a hack, just needed to preserve
the order of diacritics in special cases where the normalization would break
the sequence order (because the normalization algorithm forces the
reordering of combining marks with distinct non-zero combining classes). So
I see CGJ only as a way to bypass the default combining class and force a
diacritic to be treated as if it was of combining class 0, i.e. to mark
explicitly that reordering must not be applied during normalization. This
use of CGJ should NEVER have any semantic, except than forcing the relative
ordering of combining marks in a combining sequence.

That's why I really dislike the idea of allowing a semantic distinction
between diacritic and diacritic+CGJ: this distinction would not be usable in
all cases where a document must be encoded with a forced relative order of
diacritics. A good text preparer working before normalization should be able
to detect the cases where CGJ is needed and should be automatically added,
and the case where a CGJ is superfluous and could be removed (because there
is no other diacritic with a lower non-zero combining mark placed after the
current diacritic with non-zero combining mark).

The only case where I would accept a distinction between diacritic and
diacritic+CGJ is the case of combining diacritics with combining class 0,
for which CGJ would have no function. But this is not the case of German
umlauts/trémas, and not the case of Hebrew combining marks.

> complication in normalisation.

Only against using VSn for something else than base characters. But why not
new CVSn characters that could be used freely, unlike existing VSn which
can't be used out of the restricted set of pairs of base+BSn coding glyphic
variations of the same semantic character, and listed in the Unicode
standard.

> And, maybe for that reason, the UTC seems dead set
> against it.
> So we have to look for other solutions, according to what
> the UTC has already accepted into Unicode 4.0.1.

The only viable solution is to create separate CVSn characters, and forget
the proposed CGJ kludge...
When I say that they could be used freely, is by private agreements, but
unlike PUAs, they would be treated directly as combining characters and
would not break default grapheme clusters, and they could be parsed with
their attached combining character that they modify, and any implementation
that does not know how to treat the pair, could be allowed to render the
pair of combining marks as the single combining mark without the CVSn.

The pair would be unbreakable, and would have itself a combining class 0;
this means that CVSn+diacritic1 followed by another diacritic2 may need to
be separated by a CGJ to avoid reordering to CVSn+diacritic2+diacritic1
during normalization.

This won't violate the existing rules for VSn characters.

Next message: Philipp Reichmuth: "Re: Umlaut and Tréma, was: Variation selectors and vowel marks"
Previous message: Alain LaBonté: "Re: Much better Latin-1 keyboard for Windows"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jul 23 2004 - 07:50:01 CDT