Re: 5 & 6 byte UTF-8 encodings?

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Wed Aug 18 1999 - 10:16:32 EDT


"O'Leary, Sean (NJ)" wrote on 1999-08-18 13:40 UTC:
> OK, I'm confused. My reading of the UTF-8 spec leads me to believe that
> UTF-8 encodes characters are encoded in a maximum of 4 bytes. Characters
> from planes 0x1 through 0xF should always be handled as surrogates.

Make that planes 1 through 0x10.
 
> Yet, I've seen UTF-8 explanations that show planes 0x1 through 0xF encoded
> as 5 & 6 byte sequences.
>
> Are these 5 & 6 bytes encodings valid UTF-8? ...or... do they fall under
> the category of "Be generous in what you accept."?

I hope this resolves your confusion:

BMP characters require 1-3 UTF-8 bytes. All UCS characters that UTF-16
can represent can be represented with a 1-4 UTF-8 byte sequence. UTF-8
text should not contain any surrogate character pair, but should instead
contain the corresponding higher-plane character encoded directly. 5-6
byte UTF-8 sequences are only necessary to encode UCS characters in
plane 0x20 or higher. Nobody expects these planes ever to be used at the
moment, as only planes 0x01-0x10 are addressable via surrugates, which
leave more then enough space for any foreseen extension, and private use
of plane 0x11 or higher is therefore not recommended. Equivalently, the
use of 5-6 byte UTF-8 sequences is also not recommended.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT