Re: UTF-8 can be used for more than it is given credit

From: Theodore H. Smith (delete@elfdata.com)
Date: Sun Jun 04 2006 - 13:05:54 CDT

Next message: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit"

Previous message: Doug Ewell: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"
In reply to: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit"
Reply: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit"
Reply: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 4 Jun 2006, at 16:59, Richard Wordingham wrote:

> Theodore H. Smith wrote on Sunday, June 04, 2006 at 12:38 PM
>
>>> How do you, Theodore Smith, go about converting <U+0369, U+0345, U
>>> +0313, U+0342> to upper case (and not title case)?
>
> Correction: ᾦ <U+03C9, U+0345, U+0313, U+0342>, which should
> display the same as ᾦ and ᾦ. The correct capital form is
> ὮΙ.
>
> It seems that you would get the incorrect <U+03A9, U+0399, U+0313, U
> +0342>.

I do indeed get Ω Ι ̓ ͂ when trying to
uppercase <U+03C9, U+0345, U+0313, U+0342>. Why? What's wrong with
the result?

If there is something wrong with the result, it could be that perhaps
with smarter input UTF-8 conversion data tables, I can get correct
result.

>> Mine gives: Α ̽ Ι
>
>>> and U+03B1, U+0345, U+033D>?
>
>> Mine gives this: Α Ι ̽
>
> So your process is not Unicode-compliant, for, to use the standard
> citation form for Unicode codepoints, <U+0391, U+033D, U+0399> and
> <U+0391, U+0399, U+033D> are not canonically equivalent, whereas
> the inputs, <U+03B1, U+033D, U+0345> and <U+03B1, U+0345, U+033D>,
> are.

I'm sorry? I don't get you. Are you saying that if you take in two
equivalent inputs, uppercase them, then the outputs aren't equivalent?

That is to say that for f(x)=y, you can get different values of y for
the same value of x?

You obviously know a bit more about Unicode than me, I'm not familiar
with those glyphs or what the output should be, or even why my
conversion table creation code didn't create the correct table to
handle this case.

Can you explain maybe in another way what was wrong with the result
of my code?

>> If you could explain Normalisation to me in a 2 paragraphs, maybe
>> I'll understand you better :)
>
> Tricky if all you say is, 'I don't understand'. I had a go on
> Monday 29 May, but it took 4 paragraphs. Do you understand Normal
> Form D? That's the simplest normalisation.

Nope. I don't understand NFD yet either :) All I know is that
combining characters may cause problems for search functions, and
that some re-ordering is necessary to fix this. But what kind of re-
ordering exactly I do not know.

>> So far my UTF-8 uppercaser/lowercaser is doing quite well eh? And
>> the best thing is, it's Unicode blind. It's only byte aware.

> Vanilla uppercasing and lowercasing is mostly simple. The
> exceptions are Greek (all locales) and the Lithuanian, Turkish and
> Azerbaijani locales. These exceptions are where slight knowledge of
> the semantics comes in.

Ahhhh, locales? Well I'm not using any locale right now, I'm just
processing UnicodeData.txt and a few other files, to create a
conversion table. This is a one-off preprocessing step that results
in an 8K or so file.

This 8K file, (the conversion table) is fed into my parallel string
replacement code.

Next message: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit"
Previous message: Doug Ewell: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"
In reply to: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit"
Reply: Doug Ewell: "Re: UTF-8 can be used for more than it is given credit"
Reply: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 13:14:37 CDT