Re: UTF-8 can be used for more than it is given credit

From: Theodore H. Smith (
Date: Sun Jun 04 2006 - 13:05:54 CDT

  • Next message: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit"

    On 4 Jun 2006, at 16:59, Richard Wordingham wrote:

    > Theodore H. Smith wrote on Sunday, June 04, 2006 at 12:38 PM
    >>> How do you, Theodore Smith, go about converting <U+0369, U+0345, U
    >>> +0313, U+0342> to upper case (and not title case)?
    > Correction: ᾦ <U+03C9, U+0345, U+0313, U+0342>, which should
    > display the same as ᾦ and ᾦ. The correct capital form is
    > ὮΙ.
    > It seems that you would get the incorrect <U+03A9, U+0399, U+0313, U
    > +0342>.

    I do indeed get &#x03A9; &#x0399; &#x0313; &#x0342; when trying to
    uppercase <U+03C9, U+0345, U+0313, U+0342>. Why? What's wrong with
    the result?

    If there is something wrong with the result, it could be that perhaps
    with smarter input UTF-8 conversion data tables, I can get correct

    >> Mine gives: &#x0391; &#x033D; &#x0399;
    >>> and U+03B1, U+0345, U+033D>?
    >> Mine gives this: &#x0391; &#x0399; &#x033D;
    > So your process is not Unicode-compliant, for, to use the standard
    > citation form for Unicode codepoints, <U+0391, U+033D, U+0399> and
    > <U+0391, U+0399, U+033D> are not canonically equivalent, whereas
    > the inputs, <U+03B1, U+033D, U+0345> and <U+03B1, U+0345, U+033D>,
    > are.

    I'm sorry? I don't get you. Are you saying that if you take in two
    equivalent inputs, uppercase them, then the outputs aren't equivalent?

    That is to say that for f(x)=y, you can get different values of y for
    the same value of x?

    You obviously know a bit more about Unicode than me, I'm not familiar
    with those glyphs or what the output should be, or even why my
    conversion table creation code didn't create the correct table to
    handle this case.

    Can you explain maybe in another way what was wrong with the result
    of my code?

    >> If you could explain Normalisation to me in a 2 paragraphs, maybe
    >> I'll understand you better :)
    > Tricky if all you say is, 'I don't understand'. I had a go on
    > Monday 29 May, but it took 4 paragraphs. Do you understand Normal
    > Form D? That's the simplest normalisation.

    Nope. I don't understand NFD yet either :) All I know is that
    combining characters may cause problems for search functions, and
    that some re-ordering is necessary to fix this. But what kind of re-
    ordering exactly I do not know.

    >> So far my UTF-8 uppercaser/lowercaser is doing quite well eh? And
    >> the best thing is, it's Unicode blind. It's only byte aware.

    > Vanilla uppercasing and lowercasing is mostly simple. The
    > exceptions are Greek (all locales) and the Lithuanian, Turkish and
    > Azerbaijani locales. These exceptions are where slight knowledge of
    > the semantics comes in.

    Ahhhh, locales? Well I'm not using any locale right now, I'm just
    processing UnicodeData.txt and a few other files, to create a
    conversion table. This is a one-off preprocessing step that results
    in an 8K or so file.

    This 8K file, (the conversion table) is fed into my parallel string
    replacement code.

    This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 13:14:37 CDT