From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Mar 22 2004 - 09:50:44 EST
From: "John Cowan" <cowan@ccil.org>
> First of all, this is an extended joke.
>
> The point of the joke is that Czech sorts "ch" as a single letter after
> "h", so using a COMBINING C BEFORE would make this happen automatically,
> provided the combining character sorted after all letters.
>
> Spanish also sorts "ch" as a single letter, but after "c", so here we
> want a COMBINING H AFTER.
What would Bretons would like to see then for the "c'h" trigraph?
a COMBINING APOSTROPHE AFTER, followed by a COMBINING H AFTER, both of them
sharing the same canonical combining class or with the COMBINING APOSTROPHE
AFTER with a lower combining class than your joke-proposed COMBINING H AFTER?
Why not then a COMBINING APOSTROPHE H AFTER ?
> Of course, this is really not the way to do language-sensitive collation.
It's true that Czech and Spanish do not need such combining character.
The question of apostrophes is more difficult, as it is interpreted in some
languages either as a punctuation mark or as a combining diacritic part of a
digraph or trigraph, for example the APOSTROPHE-N that can occur at the
beginning of a word (in Czech too? I can't remember that case), and that causes
some headaches when one wants to produce a title-cased word starting by that
"sequence" (which really is a digraph, whose title-case folding <'n> is
identical to the lowercase folding <'n>), or that may be used in the same
language as a quotation mark before a word that should be titlecased
independantly.
One could resolve the ambiguity by adding a combining apostrophe before, to
allow recognizing the digraph <'n> encoded with <LATIN SMALL LETTER N, COMBINING
APOSTROPHE BEFORE>, but then this causes problems too when folding a word to
titlecase: if the language or this specific digraph is not known or recognized,
folding to titlecase may simply look at the first letter of the encoded
sequence, so that the first LATIN SMALL LETTER N would be uppercased.
Another solution is then to encode a separate apostrophe for use in isolated
combining sequences, so that it can be recognized as a plain letter. But then we
have to wonder how to do collation, if the apostrophe should be collated with
the letter that follows it in the word...
So the remaining simple solution is to encode <'n> and <'N> separately as an
unbreakable digraph character. If so, why not encoding too the Breton <c'h> and
<C'H> (which are trigraphs only if we encode them with the classic Latin
alphabet, but not if you look at the definition of the Breton alphabet where
they are unbreakable letters, in a case very similar to the <ae>, <AE>, <oe>,
and <OE> ligatures considered as plain letters in some languages, and in
Unicode, but not in French ???
This archive was generated by hypermail 2.1.5 : Mon Mar 22 2004 - 10:32:28 EST