5 and 6 byte forms of UTF-8 are not valid representations of Unicode. They are
valid representations of ISO/IEC 10646, however,
- the only assigned characters using those forms in 10646 are private use
characters, but those should never be used since they will not interoperate
with Unicode implementations. There is a large collection (128K) of other
private use characters that are in the 1-4 byte forms of UTF-8 that should be
- the Unicode consortium and SC2/WG2 do not ever expect to assign other
characters that need the 5 and 6 byte forms.
While this situation would be cleaner if this limit were formalized in 10646
and the private use characters requiring 5 and 6 byte forms were deprecated,
that is politically infeasible.
Does this help?
"O'Leary, Sean (NJ)" wrote:
> OK, I'm confused. My reading of the UTF-8 spec leads me to believe that
> UTF-8 encodes characters are encoded in a maximum of 4 bytes. Characters
> from planes 0x1 through 0xF should always be handled as surrogates.
> Yet, I've seen UTF-8 explanations that show planes 0x1 through 0xF encoded
> as 5 & 6 byte sequences.
> Are these 5 & 6 bytes encodings valid UTF-8? ...or... do they fall under
> the category of "Be generous in what you accept."?
> Sean O'Leary
> Automated Wagering International
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT