Re: UTF-8 can be used for more than it is given credit

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Jun 04 2006 - 10:59:43 CDT

Next message: Doug Ewell: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"

Previous message: Philippe Verdy: "Re: Vietnamese (Re: Unicode, SMS, PDA/cellphones)"
In reply to: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Theodore H. Smith wrote on Sunday, June 04, 2006 at 12:38 PM

>> How do you, Theodore Smith, go about converting <U+0369, U+0345, U+0313,
>> U+0342> to upper case (and not title case)?

Correction: ᾦ <U+03C9, U+0345, U+0313, U+0342>, which should display the
same as ᾦ and ᾦ. The correct capital form is ὮΙ.

It seems that you would get the incorrect <U+03A9, U+0399, U+0313, U+0342>.

>> The correct upper case form (see
>> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt ) has three
>> canonically equivalent encodings:
>> <U+1F6E GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI, U +0399
>> GREEK CAPITAL LETTER IOTA>
>> <U+1F68, U+0342, U+0399>
>> <U+03A9, U+0313, U+0342, U+0399>

>> Aside: What is the correct upper case form of <U+03B1, U+033D, U +0345>

> Mine gives: Α ̽ Ι

>> and U+03B1, U+0345, U+033D>?

> Mine gives this: Α Ι ̽

So your process is not Unicode-compliant, for, to use the standard citation
form for Unicode codepoints, <U+0391, U+033D, U+0399> and <U+0391, U+0399,
U+033D> are not canonically equivalent, whereas the inputs, <U+03B1, U+033D,
U+0345> and <U+03B1, U+0345, U+033D>, are.

> If you could explain Normalisation to me in a 2 paragraphs, maybe I'll
> understand you better :)

Tricky if all you say is, 'I don't understand'. I had a go on Monday 29
May, but it took 4 paragraphs. Do you understand Normal Form D? That's the
simplest normalisation.

> So far my UTF-8 uppercaser/lowercaser is doing quite well eh? And the
> best thing is, it's Unicode blind. It's only byte aware.

Vanilla uppercasing and lowercasing is mostly simple. The exceptions are
Greek (all locales) and the Lithuanian, Turkish and Azerbaijani locales.
These exceptions are where slight knowledge of the semantics comes in.

Richard.

Next message: Doug Ewell: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"
Previous message: Philippe Verdy: "Re: Vietnamese (Re: Unicode, SMS, PDA/cellphones)"
In reply to: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 11:15:13 CDT