From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Feb 03 2006 - 15:13:26 CST
Philippe Verdy wrote on Friday, February 03, 2006 7:59 AM:
> 4) CGJ is used to disrupt sequences that might otherwise be treated as a
> unit in sorting. (This may not be an entirely honest summary of its
> function.)
If you look at http://www.unicode.org/versions/Unicode4.1.0/#OtherChanges
you will see that its original function was to join characters for the
purpose of sorting and that this was then reversed, so that the introductory
paragraph now reads:
'U+034F COMBINING GRAPHEME JOINER is used to affect the collation of
adjacent characters for purposes of language-sensitive collation and
searching, and to distinguish sequences that would otherwise be canonically
equivalent.'
After a long digression on its ingenious uses to fix problems with combining
classes, the text for TUS 4.1.0 then returns to the original purpose of
affecting collation:
'The CGJ can also be used to prevent the formation of contractions in the
Unicode Collation Algorithm. Thus, for example, while "ch" is sorted as a
single unit in a tailored Slovak collation, the sequence <c, CGJ, h> will
sort as a 'c' followed by an 'h'...'
The original intent of CGJ was to mark such units, but in cases like this
the sequence of visible characters will form a unit more often than or not,
so it was more economical to reverse its role and use it to mark the less
frequent case where the sequence did not form a cluster. It's rather like
the linguistic case of distinguishing stop plus fricative from affricate -
if the affricate exists in a language, it is the sequence of stop plus
fricative that is 'marked', not the affricate, though the natural impulse is
to mark the affricate as unusual, e.g. by slurs in IPA.
This archive was generated by hypermail 2.1.5 : Fri Feb 03 2006 - 15:15:57 CST