From: Philippe Verdy (
Date: Wed Mar 12 2008 - 05:01:49 CST

  • Next message: Kenneth Whistler: "Re: Formal alias for U+034F COMBINING GRAPHEME JOINER (CGJ)?"

    Karl Pentzlin wrote:
    > Following the description in p.542 of TUS 5.0, the CGJ (i.e.
    > U+034F COMBINING GRAPHEME JOINER) separates graphemes, e.g.
    > in Slovak, it prevents a "ch" to be interpreted as a grapheme.
    > Thus, the CGJ splits or separates, but does not "join" in any case.

    CGJ joins combining characters that wpould aotherwise bet part of separate
    combining sequences, because its combining class is zero. This
    zero-combining class is the interesting feature of CGJ because it allows the
    canonical reordering to preserve the relative order of combining accents. It
    is effectively used as a separator, but only for the purpose of delimiting
    reorderable sequences during normalization.

    However it still has its own identity, and thus a base character followed by
    any number of combining characters or CGJ is not equivalent to the base
    character alone. So in Slovak or any other language, C + CGJ would be a
    default grapheme cluster, separated from the H that is encoded after it.

    CGJ is not used there to "separate" the two sequences. In fact Slovak in
    your example considers that a C followed by a H is a singlze letter, but it
    does not "say" anything about C+CGJ which is a grapheme cluster very
    distinct from C; this is only for that reason that it prevents the
    *semantic* interpretation of the sequence as a "CH" digraph.

    But even in this case, it does not prevent the possible formation of a
    ligature, or kerning, or any contextual forms in highly decorated font
    styles, or cursiven linking. For this reason, I do think that preventing the
    interpretation of a digraph should really not used CGJ as a distinctive
    encoding of the first letter of a candidate digraph; I'd rather use a
    separate disjoiner between C and H, in order to preserve the semantic of the
    first C.

    Notably, your CGJ does not separate words, and it does not prevent
    hyphenation (unlike digraphs where hyphenation would be preferably avoided
    in the middle):

    I would encode <C,SHY,H> for example if hyphenation is suggested (for
    example when C and H are part of distinct syllables, something that could
    happen in many languages permitting compound words and/or agglutination or
    prefixes/suffixes), or <C,WJ,H> if this is a basic separation between the
    two preserved grapheme clusters <C> and <H> that does not introduce a word

    Be warned when handling texts in languages treating pairs of letters as
    digraphs as if they were a single letter; there are almost always many
    exceptions. It would be preferable to use an explicit digraph joiner to mark
    the letter pairs, but this is almost never encoded due to the frequency of
    occurence of such digraph in such language where it is defined or viewed as
    if it was a single letter.

    But then tweaking the other exceptions by transtforming the first letter of
    candidate digraphs and appending them a CGJ looks like a severe tweak: it
    breaks the semantics if you do that on the final letter of a component
    agglutinated/coumpound with a next element whose initial letter may create
    an undesired digraph opportunity.

    Can you give examples in Slovak where CGJ is really needed between C and H
    to avoid the interpretation as a digraph? I've seen many more examples when
    it was not CGJ but SHY (and not just in Slovak). It looks like this
    "interpretation problem" only happens in languages that sort digraphs
    differently in their tailored collation. In most case, collation ordering is
    not specified or needed, and the encoding is left transparent, in order to
    preserve the orthography and semantic of encoded morphemes (including within
    compound words, or woith prefixes, suffixes, infixes).

    Due to the increasing use of borrowed words, many languages have abandonned
    the distinction of digraphs like CH and removed them from their "alphabet"
    and recognize now morphemes only lexically: if this creates a real
    ambiguity, an explicit hyphen may be written to make the distinction with
    the interpretation as a single digraph.

    This archive was generated by hypermail 2.1.5 : Wed Mar 12 2008 - 09:56:31 CST