Re: Function of CGJ (was:Use of Invisible Characters and Inscrutable Sequencing)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Feb 04 2006 - 09:01:29 CST

  • Next message: Magda Danish \(Unicode\): "New Tutorial Added, Bird-of-a-Feather Topics Announced for 29th Internationalization & Unicode Conference March 6-8, 2006, Burlingame, California"

    From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    > After a long digression on its ingenious uses to fix problems with combining
    > classes, the text for TUS 4.1.0 then returns to the original purpose of
    > affecting collation:
    >
    > 'The CGJ can also be used to prevent the formation of contractions in the
    > Unicode Collation Algorithm. Thus, for example, while "ch" is sorted as a
    > single unit in a tailored Slovak collation, the sequence <c, CGJ, h> will
    > sort as a 'c' followed by an 'h'...'

    I see NO contraction between these two uses. However, what they have in common is that they influence the collation of surrounding characters, but CGJ itself is ignorable after trhe normalization step in UCA collation, and that in both uses it does not affect the rendering of surrounding codepoints (but may influence their visible order, because Unicode states that canonically equivalent strings should have identical rendering, the absence of CGJ creating equivalences that may be undesirable for both rendering and collation, and that only the insertion of CGJ can help make distinct).

    Note that CGJ can be inconvenient in some cases, although they would still be valid Unicode strings. For example, the following three strings will be rendered identically, but will all be canonically different, and only two of them will collate identically:
    <LATIN LETTER E, COMBINING ACUTE ACCENT, COMBINING DOT BELOW>
    <LATIN LETTER E, COMBINING ACUTE ACCENT, CGJ, COMBINING DOT BELOW>
    <LATIN LETTER E, COMBINING DOT BELOW, CGJ, COMBINING ACUTE ACCENT>
    In such cases CGJ should not be used, despite it corresponds exactly to the case of multiple diacritics with distinct non-0 combining classes (here however they really don't interact like in Hebrew vowel points)



    This archive was generated by hypermail 2.1.5 : Sat Feb 04 2006 - 09:27:20 CST