Re: 5 & 6 byte UTF-8 encodings?

From: Mark Davis (mark@macchiato.com)
Date: Wed Aug 18 1999 - 10:16:39 EDT

Next message: John Cowan: "Re: Last Call: UTF-16"
Previous message: Mark Davis: "Re: Normalization Form KC for Linux"
Maybe in reply to: O'Leary, Sean (NJ): "5 & 6 byte UTF-8 encodings?"
Next in thread: Markus Kuhn: "Re: 5 & 6 byte UTF-8 encodings?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

5 and 6 byte forms of UTF-8 are not valid representations of Unicode. They are
valid representations of ISO/IEC 10646, however,

- the only assigned characters using those forms in 10646 are private use
characters, but those should never be used since they will not interoperate
with Unicode implementations. There is a large collection (128K) of other
private use characters that are in the 1-4 byte forms of UTF-8 that should be
used instead.

- the Unicode consortium and SC2/WG2 do not ever expect to assign other
characters that need the 5 and 6 byte forms.

While this situation would be cleaner if this limit were formalized in 10646
and the private use characters requiring 5 and 6 byte forms were deprecated,
that is politically infeasible.

Does this help?

Mark

"O'Leary, Sean (NJ)" wrote:

> OK, I'm confused. My reading of the UTF-8 spec leads me to believe that
> UTF-8 encodes characters are encoded in a maximum of 4 bytes. Characters
> from planes 0x1 through 0xF should always be handled as surrogates.
>
> Yet, I've seen UTF-8 explanations that show planes 0x1 through 0xF encoded
> as 5 & 6 byte sequences.
>
> Are these 5 & 6 bytes encodings valid UTF-8? ...or... do they fall under
> the category of "Be generous in what you accept."?
>
> Sean O'Leary
> oleary@awii.com
> Automated Wagering International
> 973-594-5077

Next message: John Cowan: "Re: Last Call: UTF-16"
Previous message: Mark Davis: "Re: Normalization Form KC for Linux"
Maybe in reply to: O'Leary, Sean (NJ): "5 & 6 byte UTF-8 encodings?"
Next in thread: Markus Kuhn: "Re: 5 & 6 byte UTF-8 encodings?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT