Re: Fwd: Wired 4.09 p. 130: Lost in Translation

From: Michael Everson (everson@indigo.ie)
Date: Wed Aug 28 1996 - 06:25:08 EDT


At 11:39 1996-08-27, David Beroff wrote:

>Subject: Wired 4.09 p. 130: Lost in Translation
>From: David Beroff, d4b@inet.bis.adp.com
>
>Interesting 16-bit vs. 32-bit issue for
>characters. (I guess nobody seriously
>considered 24-bit characters?)
>
>Anyway, I have an even more radical idea.
>Could Unicode support variable-length
>characters, so that one or more Unicode
>values would mean "shift"? This would
>allow quite a number of Chinese (etc.)
>characters to be represented in the
>second Unicode byte-pair.
>
>Or am I being way too whimsical?
>
>-- David Beroff <d4b@bis.adp.com>

"Decomposed" Latin characters are an example of variable-length encoding.
In Implementation Level 1 of 10646, restricting oneself to the BMP, all
characters used in text are 16-byte characters, so A and A WITH ACUTE are
the same. When combining characters are used in Implemention Level 3,
characters' identities cannot be trusted, because A might be A or it might
be A WITH ACUTE, or it might be A WITH ACUTE AND DOT BELOW, or it might be
A WITH ACUTE AND DIAERESIS AND DOT BELOW.... is there a limit?

Unicode and 10646 are not the same in this regard, in that Unicode assumes
Level 3 all the time. But it seems that it makes software more complex,
precisely because you don't know when A is A and when it is something else,
unless your software keeps checking ahead, and ahead, and ahead until it
finds something that's not combining.

Which it would have to do even for English and kiSwahili!

--
Michael Everson, Everson Gunn Teoranta
15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire (Ireland)
Gutháin:  +353 1 478-2597, +353 1 283-9396
http://www.indigo.ie/egt
27 Páirc an Fhéithlinn; Baile an Bhóthair; Co. Átha Cliath; Éire



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT