Re: UTF-8 can be used for more than it is given credit

From: Richard Wordingham (
Date: Fri Jun 09 2006 - 10:12:47 CDT

  • Next message: Adam Twardoch: "Re: Glyphs for German quotation marks"

    Theodore H. Smith wrote on Monday, June 05, 2006 at 5:43 PM
    and I replied the same day, but the reply seems to have vanished, so I'm

    > 3) Each unique glyph, has one and only sequence of codepoints in NFD. This
    > is a very good thing! Because it makes processing Unicode start to
    > resemble sanity :) To reorder the combiners whose order doesn't mater, we
    > just use their combining class number!

    Not quite true, alas, but it's mostly true. Most of the exceptions within a
    script are where different characters have the same glyph, such as the
    letter C and the Roman numeral for 100. There are a few cases in Indic
    scripts where normalisation stability prevents the solution of canonical
    equivalence being applied, and there are some irremediable cases.

    > I should have read the entire SpecialCasing.txt file manually to see what
    > it says before hoping my code will generate the right results from using
    > it :)

    Have you read TUS discussion of casing? It starts at Section 3.13. It's a
    bit uneven - the standard has clearly developed.

    > I'll fix my code to handle that funny iota-subscript character, probably
    > by using some kind of NFD code.

    > Your uppercasing and underlining example makes me think. Is it true that
    > this "combiner uppercasing to a non-combiner", character, the iota
    > subscript, can cause many problems all over Unicode, by it's very unusual
    > behaviour?

    I'm not aware of any problems apart from casing. However, I think you've
    just spotted another casing problem with it! See below.

    > You mentioned that indic vowels will also uppercase into non-combiners.

    I don't think I did - Indic scripts don't have case. The point with Indic
    vowels is that some decompose into two combining class 0 components, so not
    all decompositions are into a combining class 0 character followed by one or
    zero non-zero combining class character. There are also two Tibetan
    combining class zero vowels that decompose into two non-zero combining class

    I gove some examples of Greek text below, but be warned that they may not
    render properly. I've seen quite a variety of renderings as I've prepared
    this posting.

    > By the way, does: Α̽Ι (U+0391, U+033D, U+0399), lowercase to α̽ι
    > (U+03B1, U+033D, U+03B9)? Or to ᾳ̽ (U+03B1, U+033D, U+0345)?

    Casing operations are not reversible. U+FB00 LATIN SMALL LIGATURE FF upper
    cases to <U+0046, U+0046>, which lower cases to <U+0066, U+0066>.

    By the rules, Α̽Ι lower cases to <U+03B1, U+033D, U+03B9>, which is not
    unreasonable. But your question raises a real issue. Greek for Hades is
    <U+0391, U+0314, U+0301, U+0345, U+03B4, U+03B7, U+03C2> or ᾅδης <U+03B1,
    U+0314, U+0301, U+0345, U+03B4, U+03B7, U+03C2>. This uppercases to ἍΙΔΗΣ
    <U+0391, U+0314, U+0301, U+0399, U+0394, U+0397, U+03A3>, which in turn
    lower cases by the rules to ἅιδης <U+03B1, U+0314, U+0301, U+03B9, U+03B4,
    U+03B7, U+03C2>. Note the special rule to give the correct form of small
    sigma! However, the placement of the breathing and initial accent is
    grammatically incorrect! The only possible spellings with the accents
    before the delta are ᾅδης and αἵδης <U+03B1, U+03B9, U+0314, U+0301,
    U+03B4, U+03B7, U+03C2>. They represent different pronunciations. (There's
    a third, attested possibility if you introduce a diaeresis.) Note that
    αἵδης would uppercase to ΑἽΔΗΣ <U+0391, U+0399, U+0314, U+0301, U+0394,
    U+0397, U+03A3> - or at least, it does by Unicode rules. I believe it also
    does in Liddell and Scott, but when a capital vowel follows another vowel,
    the accents appear to the latter's right in that dictionary. (This
    rendering behaviour is not mentioned in TUS Section 7.2. It even happens
    with a diaeresis, as in ἈΪ́Ω <U+0391, U+0313, U+0399, U+0308, U+0301,
    U+03A9>, in which the diaeresis and acute appear between the iota and the
    omega.) Would any Grecians care to comment?

    It looks as though the lowercasing rules ought to be changed! However,
    there are stability issues, so it may have to be restricted by locale, e.g.
    limited to all known locales rather than being independent of locale.


    This archive was generated by hypermail 2.1.5 : Fri Jun 09 2006 - 10:15:07 CDT