Re: UTF-8 can be used for more than it is given credit

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Jun 09 2006 - 10:12:47 CDT

Next message: Adam Twardoch: "Re: Glyphs for German quotation marks"

Previous message: Richard Wordingham: "Re: Case folding"
In reply to: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit"
Reply: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Theodore H. Smith wrote on Monday, June 05, 2006 at 5:43 PM
and I replied the same day, but the reply seems to have vanished, so I'm
reposting.

> 3) Each unique glyph, has one and only sequence of codepoints in NFD. This
> is a very good thing! Because it makes processing Unicode start to
> resemble sanity :) To reorder the combiners whose order doesn't mater, we
> just use their combining class number!

Not quite true, alas, but it's mostly true. Most of the exceptions within a
script are where different characters have the same glyph, such as the
letter C and the Roman numeral for 100. There are a few cases in Indic
scripts where normalisation stability prevents the solution of canonical
equivalence being applied, and there are some irremediable cases.

> I should have read the entire SpecialCasing.txt file manually to see what
> it says before hoping my code will generate the right results from using
> it :)

Have you read TUS discussion of casing? It starts at Section 3.13. It's a
bit uneven - the standard has clearly developed.

> I'll fix my code to handle that funny iota-subscript character, probably
> by using some kind of NFD code.

> Your uppercasing and underlining example makes me think. Is it true that
> this "combiner uppercasing to a non-combiner", character, the iota
> subscript, can cause many problems all over Unicode, by it's very unusual
> behaviour?

I'm not aware of any problems apart from casing. However, I think you've
just spotted another casing problem with it! See below.

> You mentioned that indic vowels will also uppercase into non-combiners.

I don't think I did - Indic scripts don't have case. The point with Indic
vowels is that some decompose into two combining class 0 components, so not
all decompositions are into a combining class 0 character followed by one or
zero non-zero combining class character. There are also two Tibetan
combining class zero vowels that decompose into two non-zero combining class
characters.

I gove some examples of Greek text below, but be warned that they may not
render properly. I've seen quite a variety of renderings as I've prepared
this posting.

> By the way, does: Α̽Ι (U+0391, U+033D, U+0399), lowercase to α̽ι
> (U+03B1, U+033D, U+03B9)? Or to ᾳ̽ (U+03B1, U+033D, U+0345)?

Casing operations are not reversible. U+FB00 LATIN SMALL LIGATURE FF upper
cases to <U+0046, U+0046>, which lower cases to <U+0066, U+0066>.

By the rules, Α̽Ι lower cases to <U+03B1, U+033D, U+03B9>, which is not
unreasonable. But your question raises a real issue. Greek for Hades is
ᾍδης
<U+0391, U+0314, U+0301, U+0345, U+03B4, U+03B7, U+03C2> or ᾅδης <U+03B1,
U+0314, U+0301, U+0345, U+03B4, U+03B7, U+03C2>. This uppercases to ἍΙΔΗΣ
<U+0391, U+0314, U+0301, U+0399, U+0394, U+0397, U+03A3>, which in turn
lower cases by the rules to ἅιδης <U+03B1, U+0314, U+0301, U+03B9, U+03B4,
U+03B7, U+03C2>. Note the special rule to give the correct form of small
sigma! However, the placement of the breathing and initial accent is
grammatically incorrect! The only possible spellings with the accents
before the delta are ᾅδης and αἵδης <U+03B1, U+03B9, U+0314, U+0301,
U+03B4, U+03B7, U+03C2>. They represent different pronunciations. (There's
a third, attested possibility if you introduce a diaeresis.) Note that
αἵδης would uppercase to ΑἽΔΗΣ <U+0391, U+0399, U+0314, U+0301, U+0394,
U+0397, U+03A3> - or at least, it does by Unicode rules. I believe it also
does in Liddell and Scott, but when a capital vowel follows another vowel,
the accents appear to the latter's right in that dictionary. (This
rendering behaviour is not mentioned in TUS Section 7.2. It even happens
with a diaeresis, as in ἈΪ́Ω <U+0391, U+0313, U+0399, U+0308, U+0301,
U+03A9>, in which the diaeresis and acute appear between the iota and the
omega.) Would any Grecians care to comment?

It looks as though the lowercasing rules ought to be changed! However,
there are stability issues, so it may have to be restricted by locale, e.g.
limited to all known locales rather than being independent of locale.

Richard.

Next message: Adam Twardoch: "Re: Glyphs for German quotation marks"
Previous message: Richard Wordingham: "Re: Case folding"
In reply to: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit"
Reply: Philippe Verdy: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 09 2006 - 10:15:07 CDT