Re: Fwd: Wired 4.09 p. 130: Lost in Translation

From: Martin J Duerst (mduerst@ifi.unizh.ch)
Date: Wed Aug 28 1996 - 10:30:18 EDT


Michael Everson wrote:

>"Decomposed" Latin characters are an example of variable-length encoding.
>In Implementation Level 1 of 10646, restricting oneself to the BMP, all
>characters used in text are 16-byte characters, so A and A WITH ACUTE are
>the same. When combining characters are used in Implemention Level 3,
>characters' identities cannot be trusted, because A might be A or it might
>be A WITH ACUTE, or it might be A WITH ACUTE AND DOT BELOW, or it might be
>A WITH ACUTE AND DIAERESIS AND DOT BELOW.... is there a limit?
>
>Unicode and 10646 are not the same in this regard, in that Unicode assumes
>Level 3 all the time. But it seems that it makes software more complex,
>precisely because you don't know when A is A and when it is something else,
>unless your software keeps checking ahead, and ahead, and ahead until it
>finds something that's not combining.

Please be careful. To know whether an A is just only an A, you only have
to check the next position. If that next position is not a combining
character, you know it is an A, if it is a combining character, you
know it is "something else".

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT