Fw: 8-bit text which is supposed to be UTF-8 but isn't

From: Addison Phillips [GSC] (addison@globalsight.com)
Date: Sun Jan 30 2000 - 14:16:47 EST

Next message: Roozbeh Pournader: "[omega] Annoucement of Omega 1.10 (fwd)"
Previous message: Doug Ewell: "Re: 8-bit text which is supposed to be UTF-8 but isn't"
Maybe in reply to: Erland Sommarskog: "8-bit text which is supposed to be UTF-8 but isn't"
Next in thread: John Cowan: "Re: 8-bit text which is supposed to be UTF-8 but isn't"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> ISO 10646 is 31 bits. All possible values should be allowed.
> I do not know why Unicode have decided to grow their bits to
> more than 16 bits, but not to all 31 bits of ISO 10646.
> But that is no reason to not allow full 31 bits in UTF-8 encoded
> text.

The reason Unicode had to grow was that there turn out to be more than 2^16
characters to encode. By adding 15 additional 16-bit planes, there is more
than enough space to encode everything that wouldn't fit into the BMP.. and
room left for some fantasy scripts to fill our idle hours [Cirth, anyone?].

ISO 10646 has agreed, I thought, to follow Unicode's restriction and
promised, I thought, not to encode anything "out of bounds".

The reason for the restriction was the expansion mechanism chosen for
traditional 16-bit Unicode, which is surrogate pairs. These are special
characters in the BMP to represent characters in the upper planes. These
are
the surrogate pairs. Unlike many "stateful" multi-byte character sets from
the past, Unicode did programmers everywhere a huge favor. There is a
restricted range of lead-characters (character in the Unicode sense of a
two
octet 16-bit character) and a restricted range of trailing characters in a
surrogate pair. A
lead-character can never be anything BUT a lead character. A
trail-character can never be anything but a trail-character. This preserves
the extremely critical Unicode premise that if you see a character value
then that *is* the character. It may be combined with other characters, but
it is never, ever, anything else.

The alternative was shift states and the re-creation of the whole multibyte
world. Yuck.

So: since Unicode has adopted an expansion mechanism that allows only
10FFFF
characters and since there will never, ever, be any data encoded outside
that range (we have all been assured), it is IMHO a good idea to reflect
that fact in your UTF-8 implementation. It is too late to levitate out of
the corner we are painted into. A system that sees data outside the legal
range
may be dealing with a different encoding or with binary trash and should do
"something intelligent" (other than reporting that this is valid UTF-8).

thanks,

Addison

Addison Phillips
Sr. Globalization Consultant
GlobalSight Corporation

Next message: Roozbeh Pournader: "[omega] Annoucement of Omega 1.10 (fwd)"
Previous message: Doug Ewell: "Re: 8-bit text which is supposed to be UTF-8 but isn't"
Maybe in reply to: Erland Sommarskog: "8-bit text which is supposed to be UTF-8 but isn't"
Next in thread: John Cowan: "Re: 8-bit text which is supposed to be UTF-8 but isn't"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT