Re: 8-bit text which is supposed to be UTF-8 but isn't

From: Dan Oscarsson (
Date: Mon Jan 31 2000 - 03:24:14 EST

>The reason Unicode had to grow was that there turn out to be more than 2^16
>characters to encode. By adding 15 additional 16-bit planes, there is more
>than enough space to encode everything that wouldn't fit into the BMP.. and
>room left for some fantasy scripts to fill our idle hours [Cirth, anyone?].
>ISO 10646 has agreed, I thought, to follow Unicode's restriction and
>promised, I thought, not to encode anything "out of bounds".
>The reason for the restriction was the expansion mechanism chosen for
>traditional 16-bit Unicode (UCS-2), which is UTF-16.
>The alternative was shift states and the re-creation of the whole multibyte
>world. Yuck.

Yes, UTF-16 was done right. Unfortunately UTF-8 was done wrongly. UTF-8
should just like UTF-16 is compatible with code in the 16-bit space,
been compatible with the first characters of 8 bits.

>So: since Unicode has adopted an expansion mechanism that allows only 10FFFF
>characters and since there will never, ever, be any data encoded outside
>that range (we have all been assured), it is IMHO a good idea to reflect
>that fact in your UTF-8 implementation. It is too late to levitate out of
>the corner we are painted into. Building systems that prevent improper usage
>is a good data-quality check.

Just because Unicode havde decided to have UTF-16 for their 16-bit mode
does not guarantee that the range will never be expanded (well it might
not be called Unicode), ISO 10646 need not forever have this restriction.
So plan for a possible future.
As I as a programmer will either handle my characters in 8, 16 or 32-bit
words I can see no reason to place a restriction on UTF-8.
If I use 16-bit words I will only use that range. I will never use UTF-16
inside a program.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT