Re: UTF-8 can be used for more than it is given credit

From: Theodore H. Smith (
Date: Wed Jun 14 2006 - 05:27:22 CDT

  • Next message: Keutgen, Walter: "RE: Tentative Definition of Casefolding"

    <U+03C9, U+0345, U+0301, U+0302, U+0307, U+0308> ( ῴ̂̇̈ ) will
    uppercase to: U+03A9, U+0301, U+0302, U+0307, U+0308, U+0399,
    ( Ώ̂̇̈Ι ), using my uppercaser, now.

    <U+03C9, U+0345, U+0301, U+0302, U+0307, U+0308, U+0F73>
    ( ῴཱི̂̇̈ ) will uppercase to: U+03A9, U+0F71, U+0F72, U+0301,
    U+0302, U+0307, U+0308, U+0399 ( Ώཱི̂̇̈Ι ) now

    Funnily enough... when I do an NFD on ( ῴཱི̂̇̈ ), it ends up
    looking like a different character: ( ῴཱི̂̇̈ ). I'm not
    sure why it should look different. Either a bug in my code, or
    perhaps my OS is using an older version of Unicode?

    "Is this single pass, or multi-pass? I think it has to be multi-pass.
    to transform to NFD, it needs, for Unicode 4.1.0, 55,903 codepoint
    swaps to
    be stored in the data table. "

    While my uppercaser does do stuff using a single-pass replacement...
    my combining reorder does not use a "parallell replace all".

    It does a "parallell search", and then uses single-pass combiner-re-
    ordering specific code on the items found in the search. It can in
    theory also reorder byte-wise combining characters, given a correct
    data-set, but I don't know if even a character set exists whose
    characters take up one byte and has combining characters!!

    "However, I believe you are having to resort to multiple passes
    because you don't store canonical combining class."

    I am storing the class, in a 1 byte long string :) Well, to do NFD is
    two pass, because there is the decomposition pass, and then the
    reordering pass.

    The re-ordering pass, is not using my "multiple-replaceall"
    algorithm. It does use the canonical combining classes. A multi-pass
    approach, while possible... I wouldn't do, it would take too long.

    To do uppercase is then three passes: decomp, reorder, uppercase.

    I hope I don't have to do a NFD after the uppercase?


    This archive was generated by hypermail 2.1.5 : Wed Jun 14 2006 - 11:25:38 CDT