Re: UTF-8 can be used for more than it is given credit

From: Theodore H. Smith (delete@elfdata.com)
Date: Mon Jun 05 2006 - 11:43:25 CDT

  • Next message: Mike Ayers: "Re: Vietnamese (Re: Unicode, SMS, PDA/cellphones)"

    Hi Richard,

    I've looked at your email, and I'm going to try to rephrase what you
    said, with my words, to see if I got it.

    1) Some characters can be composed or decomposed (this I knew
    already, no problem.)

    2) Combining characters have "Combining classes", which is basically
    to say if two or more combining characters, their order doesn't
    matter. For example a dot below or above a letter, they can never be
    in same place, so whether you write the below dot first or after the
    above dot... it's still the same letter, they are "Cannonically
    equivalent". If two letters have the same combining class, then their
    order does matter, the combiners might stack or be arranged
    horizontally somehow.

    3) Each unique glyph, has one and only sequence of codepoints in NFD.
    This is a very good thing! Because it makes processing Unicode start
    to resemble sanity :) To reorder the combiners whose order doesn't
    mater, we just use their combining class number!

    OK, I think I get the problem with my uppercaser of the Omega. It's
    uppercasing: <U+03C9, U+0345, U+0313, U+0342> to <U+03A9, U+0399, U
    +0313, U+0342>, when it should result in <U+03A9, U+0313, U+0342, U
    +0399> . the U+339 is in the wrong place, basically.

    I had a look into this, "why did my code do the wrong thing". The
    result of my investigation?: I should have read the entire
    SpecialCasing.txt file manually to see what it says before hoping my
    code will generate the right results from using it :) That was my
    mistake, to just write some code that works most of the time without
    making sure it works all the time.

    I'll fix my code to handle that funny iota-subscript character,
    probably by using some kind of NFD code.

    Your uppercasing and underlining example makes me think. Is it true
    that this "combiner uppercasing to a non-combiner", character, the
    iota subscript, can cause many problems all over Unicode, by it's
    very unusual behaviour? You mentioned that indic vowels will also
    uppercase into non-combiners. But does that need special treatment
    beyond NFD ing the text first? I don't see any mention of indic
    within SpecialCasing.txt.

    By the way, does: Α̽Ι (U+0391, U+033D, U+0399), lowercase to
    α̽ι (U+03B1, U+033D, U+03B9)? Or to ᾳ̽ (U+03B1, U+033D, U+0345)?

    Richard you've done me a great service already by spending what looks
    like a huge amount of your expert time answering my questions for
    free, and I know an expert's time can usually command a high price :)

    Apologies for all the questions. I'll make it worth it however by
    adding some NFD code, and fixing all the bugs you've made me aware of.



    This archive was generated by hypermail 2.1.5 : Mon Jun 05 2006 - 11:58:54 CDT