Re: Irish dotless I (was: Languages with letters that always take diacriticals

From: Philippe Verdy (
Date: Mon Mar 22 2004 - 09:50:44 EST

  • Next message: Pavel Adamek: "Re: Irish dotless I (was: Languages with letters that always take diacriticals"

    From: "John Cowan" <>
    > First of all, this is an extended joke.
    > The point of the joke is that Czech sorts "ch" as a single letter after
    > "h", so using a COMBINING C BEFORE would make this happen automatically,
    > provided the combining character sorted after all letters.
    > Spanish also sorts "ch" as a single letter, but after "c", so here we
    > want a COMBINING H AFTER.

    What would Bretons would like to see then for the "c'h" trigraph?
    a COMBINING APOSTROPHE AFTER, followed by a COMBINING H AFTER, both of them
    sharing the same canonical combining class or with the COMBINING APOSTROPHE
    AFTER with a lower combining class than your joke-proposed COMBINING H AFTER?

    > Of course, this is really not the way to do language-sensitive collation.

    It's true that Czech and Spanish do not need such combining character.

    The question of apostrophes is more difficult, as it is interpreted in some
    languages either as a punctuation mark or as a combining diacritic part of a
    digraph or trigraph, for example the APOSTROPHE-N that can occur at the
    beginning of a word (in Czech too? I can't remember that case), and that causes
    some headaches when one wants to produce a title-cased word starting by that
    "sequence" (which really is a digraph, whose title-case folding <'n> is
    identical to the lowercase folding <'n>), or that may be used in the same
    language as a quotation mark before a word that should be titlecased

    One could resolve the ambiguity by adding a combining apostrophe before, to
    allow recognizing the digraph <'n> encoded with <LATIN SMALL LETTER N, COMBINING
    APOSTROPHE BEFORE>, but then this causes problems too when folding a word to
    titlecase: if the language or this specific digraph is not known or recognized,
    folding to titlecase may simply look at the first letter of the encoded
    sequence, so that the first LATIN SMALL LETTER N would be uppercased.

    Another solution is then to encode a separate apostrophe for use in isolated
    combining sequences, so that it can be recognized as a plain letter. But then we
    have to wonder how to do collation, if the apostrophe should be collated with
    the letter that follows it in the word...

    So the remaining simple solution is to encode <'n> and <'N> separately as an
    unbreakable digraph character. If so, why not encoding too the Breton <c'h> and
    <C'H> (which are trigraphs only if we encode them with the classic Latin
    alphabet, but not if you look at the definition of the Breton alphabet where
    they are unbreakable letters, in a case very similar to the <ae>, <AE>, <oe>,
    and <OE> ligatures considered as plain letters in some languages, and in
    Unicode, but not in French ???

    This archive was generated by hypermail 2.1.5 : Mon Mar 22 2004 - 10:32:28 EST