Function of CGJ (was:Use of Invisible Characters and Inscrutable Sequencing)

From: Richard Wordingham (
Date: Fri Feb 03 2006 - 15:13:26 CST

  • Next message: Edward Trager: "ANNOUNCEMENT: "Unicode Font Guide For Free/Libre Open Source Operating Systems" HAS MOVED TO NEW UNIFONT.ORG WEBSITE"

    Philippe Verdy wrote on Friday, February 03, 2006 7:59 AM:

    > 4) CGJ is used to disrupt sequences that might otherwise be treated as a
    > unit in sorting. (This may not be an entirely honest summary of its
    > function.)

    If you look at
    you will see that its original function was to join characters for the
    purpose of sorting and that this was then reversed, so that the introductory
    paragraph now reads:

    'U+034F COMBINING GRAPHEME JOINER is used to affect the collation of
    adjacent characters for purposes of language-sensitive collation and
    searching, and to distinguish sequences that would otherwise be canonically

    After a long digression on its ingenious uses to fix problems with combining
    classes, the text for TUS 4.1.0 then returns to the original purpose of
    affecting collation:

    'The CGJ can also be used to prevent the formation of contractions in the
    Unicode Collation Algorithm. Thus, for example, while "ch" is sorted as a
    single unit in a tailored Slovak collation, the sequence <c, CGJ, h> will
    sort as a 'c' followed by an 'h'...'

    The original intent of CGJ was to mark such units, but in cases like this
    the sequence of visible characters will form a unit more often than or not,
    so it was more economical to reverse its role and use it to mark the less
    frequent case where the sequence did not form a cluster. It's rather like
    the linguistic case of distinguishing stop plus fricative from affricate -
    if the affricate exists in a language, it is the sequence of stop plus
    fricative that is 'marked', not the affricate, though the natural impulse is
    to mark the affricate as unusual, e.g. by slurs in IPA.

    This archive was generated by hypermail 2.1.5 : Fri Feb 03 2006 - 15:15:57 CST