Re: UTF-8 can be used for more than it is given credit

From: Theodore H. Smith (delete@elfdata.com)
Date: Sun Jun 04 2006 - 06:38:04 CDT

  • Next message: Adam Twardoch: "Re: Glyphs for German quotation marks"

    > Unnecessary. Just sketch the solutions.
    >
    >> Would that prove to you that you can do uppercasing and
    >> lowercasing on UTF-8 without worrying about the codepoints?
    >
    > Here's a test case -
    > U+1FA6 GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND
    > YPOGEGRAMMENI
    >
    > U+1FA6 decomposes to <U+03C9, U+0313, U+0342, U+0345> (combining
    > classes 0, 230, 230 and 240 respectively).

    My UTF-8 decomposer gives that result :)

    Although it expressed the decomp like this: &#x03C9; &#x0313;
    &#x0342; &#x0345;

    It uppercased the UTF-8 form to this a UTF-8 sequence which was
    equivalent to this: &#x03A9; &#x0313; &#x0342; &#x0399;

    > How do you, Theodore Smith, go about converting <U+0369, U+0345, U
    > +0313, U+0342> to upper case (and not title case)?
    >
    > The correct upper case form (see http://www.unicode.org/Public/
    > UNIDATA/SpecialCasing.txt ) has three canonically equivalent
    > encodings:
    > <U+1F6E GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI, U
    > +0399 GREEK CAPITAL LETTER IOTA>
    > <U+1F68, U+0342, U+0399>
    > <U+03A9, U+0313, U+0342, U+0399>
    >
    > Aside: What is the correct upper case form of <U+03B1, U+033D, U
    > +0345>

    Mine gives: &#x0391; &#x033D; &#x0399;

    > and U+03B1, U+0345, U+033D>?

    Mine gives this: &#x0391; &#x0399; &#x033D;

    > Is it truly <U+0391, U+033D, U+0399>? I suspect it depends on the
    > semantics being applied to U+033D COMBINING X ABOVE.
    >
    > Conversion to normal form D sounds rather brute force. By my
    > calculation, for Unicode 4.1 you have 55,903 pairs of characters to
    > swap round, composed from the 384 characters not of combining class 0.

    Yes... I don't do Normalisation yet on UTF-8, because I still don't
    udnerstand Normalisation properly :)

    > Normal Form C is even worse for brute force. Just to compose U
    > +1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI you have to have
    > 384-8 = 376 3-element substitutions, such as <U+03B1, U+033D, U
    > +0345> to <U+1FB3, U+033D>, 376 * 376 = 141,376 4-element
    > substitutions,... (It has been suggested that it is unreasonable
    > to ask for sequences of more than 30 combining characters to be
    > processed properly.)

    If you could explain Normalisation to me in a 2 paragraphs, maybe
    I'll understand you better :)

    So far my UTF-8 uppercaser/lowercaser is doing quite well eh? And the
    best thing is, it's Unicode blind. It's only byte aware.

    I really should put this into a web available form because that
    statement seems to put people's minds into a loop.

    As for "Just sketch the solutions"... I did that already, in previous
    emails. It requires a string based dictionary to do at all. Something
    not too hard, as even stl's hash_map can do this on a char*.

    And to do efficiently it requires a trie based string dictionary
    which is capable of detecting the longest key at a position within
    the string.

    Do you have a trie based string dictionary that works on unsigned
    chars? Do you have one which has a complete and powerful API for
    processing strings? If not, I can imagine why you haven't thought it
    was possible yet. Such tools aren't common.

    The toolkit I'm using, I wrote myself, and I've not seen anything
    like it yet.



    This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 06:55:00 CDT