Re: CGJ for variant combining marks, in Hebrew as well as German?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jul 23 2004 - 07:49:10 CDT

  • Next message: Philipp Reichmuth: "Re: Umlaut and Tréma, was: Variation selectors and vowel marks"

    From: "Peter Kirk" <peterkirk@qaya.org>
    > On 17/07/2004 14:33, Philippe Verdy wrote:
    > >My opinion would be to place a invisible combining character after the
    > >diaeresis or umlaut or precomposed letter, ignorable by default in UCA
    but
    > >which could be tailored for German. However the UTC has a preference for
    > >CGJ+diacritic.
    > >
    > >I am not sure that CGJ is made to allow such distinctions, and I would
    > >prefer a Combining Variation Selector, with combining class 0 like CGJ,
    > >encoded after the diacritic or precomposed letter that uses it (for
    > >precomposed letters, its semantic would alter the last diacritic coded in
    > >the normalized decomposed form, to preserve normalization).
    >
    > Philippe, I agree with you that such Combining Variation Selectors would
    > be the best way to deal with umlaut/tréma and also with variants of
    > HOLAM, METEG, QAMATS etc. Actually, the existing variation selectors
    > could be used, given that they are already combining characters in
    > combining class 0.

    I know that, but the existing variation selectors are extremely restricted
    in Unicode, as they are usable only for predefined character pairs, for
    which they are only coding glyph variations, rather than variation in
    semantics, which is what would be needed for german umlauts (implied
    supplementary vowel e for collation, treated at primary level) or german
    diaeresis/tréma (the vowel isolator mark, rare in German but frequent in
    French, to be treated at secondary collation level and for which a combining
    variation would be needed), or in Hebrew for specific distinctions (dagesh,
    vav, ...)

    But this use of existing VSn would cause complications with the UCA
    algorithm, which is already tuned to ignore them ALL, and because they are
    currently used for now for base characters.

    Using CGJ after a combining mark is really a hack, just needed to preserve
    the order of diacritics in special cases where the normalization would break
    the sequence order (because the normalization algorithm forces the
    reordering of combining marks with distinct non-zero combining classes). So
    I see CGJ only as a way to bypass the default combining class and force a
    diacritic to be treated as if it was of combining class 0, i.e. to mark
    explicitly that reordering must not be applied during normalization. This
    use of CGJ should NEVER have any semantic, except than forcing the relative
    ordering of combining marks in a combining sequence.

    That's why I really dislike the idea of allowing a semantic distinction
    between diacritic and diacritic+CGJ: this distinction would not be usable in
    all cases where a document must be encoded with a forced relative order of
    diacritics. A good text preparer working before normalization should be able
    to detect the cases where CGJ is needed and should be automatically added,
    and the case where a CGJ is superfluous and could be removed (because there
    is no other diacritic with a lower non-zero combining mark placed after the
    current diacritic with non-zero combining mark).

    The only case where I would accept a distinction between diacritic and
    diacritic+CGJ is the case of combining diacritics with combining class 0,
    for which CGJ would have no function. But this is not the case of German
    umlauts/trémas, and not the case of Hebrew combining marks.

    > complication in normalisation.

    Only against using VSn for something else than base characters. But why not
    new CVSn characters that could be used freely, unlike existing VSn which
    can't be used out of the restricted set of pairs of base+BSn coding glyphic
    variations of the same semantic character, and listed in the Unicode
    standard.

    > And, maybe for that reason, the UTC seems dead set
    > against it.
    > So we have to look for other solutions, according to what
    > the UTC has already accepted into Unicode 4.0.1.

    The only viable solution is to create separate CVSn characters, and forget
    the proposed CGJ kludge...
    When I say that they could be used freely, is by private agreements, but
    unlike PUAs, they would be treated directly as combining characters and
    would not break default grapheme clusters, and they could be parsed with
    their attached combining character that they modify, and any implementation
    that does not know how to treat the pair, could be allowed to render the
    pair of combining marks as the single combining mark without the CVSn.

    The pair would be unbreakable, and would have itself a combining class 0;
    this means that CVSn+diacritic1 followed by another diacritic2 may need to
    be separated by a CGJ to avoid reordering to CVSn+diacritic2+diacritic1
    during normalization.

    This won't violate the existing rules for VSn characters.



    This archive was generated by hypermail 2.1.5 : Fri Jul 23 2004 - 07:50:01 CDT