Re: UTF-8 can be used for more than it is given credit

From: Theodore H. Smith (delete@elfdata.com)
Date: Mon Jun 05 2006 - 11:43:25 CDT

Next message: Mike Ayers: "Re: Vietnamese (Re: Unicode, SMS, PDA/cellphones)"

Previous message: Erkki Kolehmainen: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"
In reply to: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit"
Reply: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit"
Reply: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi Richard,

I've looked at your email, and I'm going to try to rephrase what you
said, with my words, to see if I got it.

1) Some characters can be composed or decomposed (this I knew
already, no problem.)

2) Combining characters have "Combining classes", which is basically
to say if two or more combining characters, their order doesn't
matter. For example a dot below or above a letter, they can never be
in same place, so whether you write the below dot first or after the
above dot... it's still the same letter, they are "Cannonically
equivalent". If two letters have the same combining class, then their
order does matter, the combiners might stack or be arranged
horizontally somehow.

3) Each unique glyph, has one and only sequence of codepoints in NFD.
This is a very good thing! Because it makes processing Unicode start
to resemble sanity :) To reorder the combiners whose order doesn't
mater, we just use their combining class number!

OK, I think I get the problem with my uppercaser of the Omega. It's
uppercasing: <U+03C9, U+0345, U+0313, U+0342> to <U+03A9, U+0399, U
+0313, U+0342>, when it should result in <U+03A9, U+0313, U+0342, U
+0399> . the U+339 is in the wrong place, basically.

I had a look into this, "why did my code do the wrong thing". The
result of my investigation?: I should have read the entire
SpecialCasing.txt file manually to see what it says before hoping my
code will generate the right results from using it :) That was my
mistake, to just write some code that works most of the time without
making sure it works all the time.

I'll fix my code to handle that funny iota-subscript character,
probably by using some kind of NFD code.

Your uppercasing and underlining example makes me think. Is it true
that this "combiner uppercasing to a non-combiner", character, the
iota subscript, can cause many problems all over Unicode, by it's
very unusual behaviour? You mentioned that indic vowels will also
uppercase into non-combiners. But does that need special treatment
beyond NFD ing the text first? I don't see any mention of indic
within SpecialCasing.txt.

By the way, does: Α̽Ι (U+0391, U+033D, U+0399), lowercase to
α̽ι (U+03B1, U+033D, U+03B9)? Or to ᾳ̽ (U+03B1, U+033D, U+0345)?

Richard you've done me a great service already by spending what looks
like a huge amount of your expert time answering my questions for
free, and I know an expert's time can usually command a high price :)

Apologies for all the questions. I'll make it worth it however by
adding some NFD code, and fixing all the bugs you've made me aware of.

Next message: Mike Ayers: "Re: Vietnamese (Re: Unicode, SMS, PDA/cellphones)"
Previous message: Erkki Kolehmainen: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"
In reply to: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit"
Reply: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit"
Reply: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jun 05 2006 - 11:58:54 CDT