Re: UTF-8 can be used for more than it is given credit

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Tue Jun 13 2006 - 17:18:59 CDT

  • Next message: Richard Wordingham: "Re: triple diacritic (sch with ligature tie in a German dialect writing document)"

    Theodore H. Smith wrote on Tuesday, June 13, 2006 at 2:39 PM

    > On 4 Jun 2006, at 22:19, Richard Wordingham wrote:

    >> But http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt states that
    >> the upper case form of U+1FA6 is <U+1F6E, U+0399>. But
    >> <U+1F6E, U+0399> ~ <U+03A9, U+0313, U+0342, U+0399>, which is not
    >> canonically equivalent to <U+03A9, U+0399, U+0313, U+0342>. That
    >> is what is wrong.

    > For what it's worth, even my "NFD that fails NormalizationTests.txt" code
    > (that I wrote over the weekend), now can handle this :)

    > <U+03C9, U+0345, U+0313, U+0342> (ᾦ) now will uppercase to <U
    +03A9, U+0313, U+0342, U+0399> ( ὮΙ ), using my new UTF-8
    uppercaser :)

    > Actually, now that I understand a little more of what's going on, I
    can see that you did throw me a bit of a screw-ball here ;)

    What do you get for <U+03C9, U+0345, U+0301, U+0302, U+0307, U+0308>?

    And for the googly - <U+03C9, U+0345, U+0301, U+0302, U+0307, U+0308,
    U+0F73>?

    > You were entirely correct my code did not uppercase properly unless
    it could handle denormalised characters, due to funny characters
    which change from combiners to non-combiners during uppercasing.

    > My code basically works like this:
    <Snip>
    > 2) Unicode-blind stage, this does the uppercasing/lowercasing/NFD stuff.
    > It's all byte-aware! Well, more specifically, it is "variable length
    > string unit aware". But the "string units" are composed of bytes, not
    > shorts or longs.

    Is this single pass, or multi-pass? I think it has to be multi-pass. And,
    to transform to NFD, it needs, for Unicode 4.1.0, 55,903 codepoint swaps to
    be stored in the data table.

    > Does this prove that you can correctly process UTF-8 natively, on a
    per-character basis, without intermediate conversion to codepoints or
    UTF-32?

    The YPOGEGRAMMENI issue was not as bad as I first thought. And I owe you an
    apology, for it appears that your implementation actually was correct!
    Sorry. What you have now is merely linguistically better, rather than more
    correct. :-(

    I never thought it couldn't be done. However, I believe you are having to
    resort to multiple passes because you don't store canonical combining class.
    (Obviously, you could store that using a UTF-8 based trie. My code, written
    for understanding rather than speed, effectively uses a trie with letters
    from different alphabets - a 17 character alphabet (i.e. plane), a 512
    character alphabet (half-block within plane), and a 128 character alphabet
    (byte within the block).

    Richard.



    This archive was generated by hypermail 2.1.5 : Tue Jun 13 2006 - 18:29:32 CDT