Re: UTF-8 can be used for more than it is given credit

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Wed Jun 14 2006 - 12:39:08 CDT

Next message: Mark Davis: "Re: Tentative Definition of Casefolding"

Previous message: Kenneth Whistler: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"
In reply to: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Theodore H. Smith wrote on Wednesday, June 14, 2006 at 11:27 AM

> <U+03C9, U+0345, U+0301, U+0302, U+0307, U+0308, U+0F73> ( ῴཱི̂̇̈ ) will
> uppercase to: U+03A9, U+0F71, U+0F72, U+0301, U+0302, U+0307, U+0308,
> U+0399 ( Ώཱི̂̇̈Ι ) now

Well done!

> Funnily enough... when I do an NFD on ( ῴཱི̂̇̈ ), it ends up looking
> like a different character: ( ῴཱི̂̇̈ ). I'm not sure why it should look
> different. Either a bug in my code, or perhaps my OS is using an older
> version of Unicode?

No, it's just a rendering failure. In the first case the fonts used do not
have any data for U+0F73; after NFD they do not have any data for the two
characters to which it decomposes, U+0F71 and U+0F72.

"Is this single pass, or multi-pass? I think it has to be multi-pass.
And,
to transform to NFD, it needs, for Unicode 4.1.0, 55,903 codepoint
swaps to
be stored in the data table. "

While my uppercaser does do stuff using a single-pass replacement...
my combining reorder does not use a "parallell replace all".

It does a "parallell search", and then uses single-pass combiner-re-
ordering specific code on the items found in the search. It can in
theory also reorder byte-wise combining characters, given a correct
data-set, but I don't know if even a character set exists whose
characters take up one byte and has combining characters!!

"However, I believe you are having to resort to multiple passes
because you don't store canonical combining class."

> I am storing the class, in a 1 byte long string :) Well, to do NFD is two
> pass, because there is the decomposition pass, and then the reordering
> pass.

> The re-ordering pass, is not using my "multiple-replaceall" algorithm. It
> does use the canonical combining classes. A multi-pass approach, while
> possible... I wouldn't do, it would take too long.

And this was the basis of the claim that you couldn't just treat characters
as 'bags of bytes'!

> I hope I don't have to do a NFD after the uppercase?

I think you don't, but you would after a Turkic uppercasing (<U+0069,
U+0331> to <U+0069, U+0331, U+0307> - not normal text) or a Lithuanian
lowercasing (<U+0049, U+0328, U+0301> to <U+0069, U+0328, U+0307, U+0301> -
very plain text!) if you wanted the output to be in NFD.

Richard.

Next message: Mark Davis: "Re: Tentative Definition of Casefolding"
Previous message: Kenneth Whistler: "Re: are Unicode codes somehow specified in official national linguistic literature ? (worldwide)"
In reply to: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 14 2006 - 13:59:46 CDT