Function of CGJ (was:Use of Invisible Characters and Inscrutable Sequencing)

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Feb 03 2006 - 15:13:26 CST

Next message: Edward Trager: "ANNOUNCEMENT: "Unicode Font Guide For Free/Libre Open Source Operating Systems" HAS MOVED TO NEW UNIFONT.ORG WEBSITE"

Previous message: Rick McGowan: "Announcement: CLDR 1.4 Data Submission Period Now Starting"
Next in thread: Philippe Verdy: "Re: Function of CGJ (was:Use of Invisible Characters and Inscrutable Sequencing)"
Reply: Philippe Verdy: "Re: Function of CGJ (was:Use of Invisible Characters and Inscrutable Sequencing)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy wrote on Friday, February 03, 2006 7:59 AM:

> 4) CGJ is used to disrupt sequences that might otherwise be treated as a
> unit in sorting. (This may not be an entirely honest summary of its
> function.)

If you look at http://www.unicode.org/versions/Unicode4.1.0/#OtherChanges
you will see that its original function was to join characters for the
purpose of sorting and that this was then reversed, so that the introductory
paragraph now reads:

'U+034F COMBINING GRAPHEME JOINER is used to affect the collation of
adjacent characters for purposes of language-sensitive collation and
searching, and to distinguish sequences that would otherwise be canonically
equivalent.'

After a long digression on its ingenious uses to fix problems with combining
classes, the text for TUS 4.1.0 then returns to the original purpose of
affecting collation:

'The CGJ can also be used to prevent the formation of contractions in the
Unicode Collation Algorithm. Thus, for example, while "ch" is sorted as a
single unit in a tailored Slovak collation, the sequence <c, CGJ, h> will
sort as a 'c' followed by an 'h'...'

The original intent of CGJ was to mark such units, but in cases like this
the sequence of visible characters will form a unit more often than or not,
so it was more economical to reverse its role and use it to mark the less
frequent case where the sequence did not form a cluster. It's rather like
the linguistic case of distinguishing stop plus fricative from affricate -
if the affricate exists in a language, it is the sequence of stop plus
fricative that is 'marked', not the affricate, though the natural impulse is
to mark the affricate as unusual, e.g. by slurs in IPA.

Next message: Edward Trager: "ANNOUNCEMENT: "Unicode Font Guide For Free/Libre Open Source Operating Systems" HAS MOVED TO NEW UNIFONT.ORG WEBSITE"
Previous message: Rick McGowan: "Announcement: CLDR 1.4 Data Submission Period Now Starting"
Next in thread: Philippe Verdy: "Re: Function of CGJ (was:Use of Invisible Characters and Inscrutable Sequencing)"
Reply: Philippe Verdy: "Re: Function of CGJ (was:Use of Invisible Characters and Inscrutable Sequencing)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Feb 03 2006 - 15:15:57 CST