Re: 5 & 6 byte UTF-8 encodings?

From: Mark Davis (mark@macchiato.com)
Date: Wed Aug 18 1999 - 10:16:39 EDT


5 and 6 byte forms of UTF-8 are not valid representations of Unicode. They are
valid representations of ISO/IEC 10646, however,

- the only assigned characters using those forms in 10646 are private use
characters, but those should never be used since they will not interoperate
with Unicode implementations. There is a large collection (128K) of other
private use characters that are in the 1-4 byte forms of UTF-8 that should be
used instead.

- the Unicode consortium and SC2/WG2 do not ever expect to assign other
characters that need the 5 and 6 byte forms.

While this situation would be cleaner if this limit were formalized in 10646
and the private use characters requiring 5 and 6 byte forms were deprecated,
that is politically infeasible.

Does this help?

Mark

"O'Leary, Sean (NJ)" wrote:

> OK, I'm confused. My reading of the UTF-8 spec leads me to believe that
> UTF-8 encodes characters are encoded in a maximum of 4 bytes. Characters
> from planes 0x1 through 0xF should always be handled as surrogates.
>
> Yet, I've seen UTF-8 explanations that show planes 0x1 through 0xF encoded
> as 5 & 6 byte sequences.
>
> Are these 5 & 6 bytes encodings valid UTF-8? ...or... do they fall under
> the category of "Be generous in what you accept."?
>
> Sean O'Leary
> oleary@awii.com
> Automated Wagering International
> 973-594-5077



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT