Re: 8-bit text which is supposed to be UTF-8 but isn't

From: Dan Oscarsson (Dan.Oscarsson@trab.se)
Date: Tue Feb 01 2000 - 03:14:28 EST


>Dan Oscarsson wrote:
>
>> Yes, UTF-16 was done right. Unfortunately UTF-8 was done wrongly. UTF-8
>> should just like UTF-16 is compatible with code in the 16-bit space,
>> been compatible with the first characters of 8 bits.
>
>How? An 8-bit code compatible with UTF-16 in its first 8 bits has
>no space left to represent the other 109744 codepoints. Unlike the
>16-bit codespace from 0 to FFFF, the 8-bit codespace from 0 to FF is
>densely packed with characters.
>

My text was maybe unclear. UTF-8 should represent the characters
of UCS in the code range 0-255 as themselves, just like UTF-16 does
for UCS in the 16-bit range.
As there are two sets of control spaces in the first 256 code points,
and one of them is nearly not used, they could be used to make it work.
But it is to late to fix that now.

But restricting UTF-8 to less than full 31 bits of ISO 10646 just
because the beloved UTF-16 of Unicode, I can see no reason. I have
no need for UTF-16. UTF-8, UCS-1, UCS-2 and UCS-4 will do fine.

   Dan



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT