L2/04-304 Source: Mark Davis Subject: CGJ in German Date: July 23 2004 The recommended use of CGJ in German has problems. For context, let's start with a bit of history. The original proposal, long ago, was for a pair of characters, a CGJ and a CGNJ. In that proposal, the characters were to have been used both to control visual units and processing units, especially collation. For visual units, the joiner is the most useful; and the non-joiner not so useful. For collation units, it is the reverse: the non-joiner is the most useful, and the joiner is not so useful. During the incorporation and refinement of this proposal, we ended dropping the non-joiner character, and also concluding that controlling visual units was too problematic, and stripping that away. So, ironically, we ended up adopting the character of the pair most useful for visual units, but then stripping the semantics down to only covering collation, where it is not as useful as the character we dropped! In collation, the non-joiner is the most useful, because it would be used to *prevent* two characters from being considered as a unit. That allows people to mark the exceptional cases, where two adjacent characters would normally be processed as a unit in a given language -- but in this case shouldn't be. It is the more useful function because nobody is going to mark all the cases in text for a given language that *normally* are treated as a unit. On the other hand, in collating Slovak one may want to know that in some foreign word an adjacent c and h are to be treated as separate, and mark that case. The volume of exceptions, is a fraction of the normal cases -- otherwise they wouldn't be exceptions! So using the non-joiner is much more useful. And this works really well with collation. The mere presence of a character with no weights, but one that blocks the formation of a contraction that would normally otherwise occur, is something that can be generally implemented. Whereas each instance of a joiner has to be specific; in collation not only do you need to know that it is a unit, you have to know how that unit behaves. Knowing that 'ch' is a unit doesn't help you: a Slovak ch or a Spanish ch will sort very differently. You have to not only know that it is a unit, you have to tailor that whole unit as a contraction. And prevention is what the CGJ is actually being used for in the recommendation for German; it is being used to *prevent* the collation contraction between O and dieresis that would normally occur in German phonebook sorting. It ends up having the opposite meaning to how we define it on p 392. As a collation non-joiner we have the overload of the SHY, which is a hack, since 'allowed positions for hyphenation' may or may not coincide with 'places where collation units should be broken'. It happens to work ok in Danish, but by chance; and SHY won't work within a combining sequence, which is why CGJ was impressed into service for German. I see three courses: - Continue as is, with a growing set of special cases like German, and no principled application. - Bite the bullet and for consistency with this application of CGJ in German, and the most probably future applications of CGJ, revise the semantics of the CGJ to be really that of a non-joiner, not a joiner. After all, the "CJ" part isn't accurate any more anyway ;-) - Encode the CGNJ -- perhaps under the better name Collation Non-Joiner (CNJ) -- and use that for what ends up being the important usage, including this German case. Mark