Re: Limitation of 0x10FFFF (about UTF-32)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jul 27 1999 - 15:50:58 EDT


Otto Stolz wrote:

>
> Am 1999-7-26 um 7:06 hat Mark Davis geschrieben:
> > there is no public way to distinguish between Unicode's definition
> > of UTF-8 and ISO's. We'll have to think about this one.
>
> I think, there should be no technical difference between those two
> definitions. So, rather than thinking about making the distinction
> public, you should think about reconciling the two definitions.
>

Unfortunately, for UTF-8, as for UTF-32, there *is* an important
difference that must be taken into account: the range of encoded
characters that can be used interoperably.

The basic range of relevance to the Unicode Standard is that which
is accessible via UTF-16, namely U-00000000..U-0010FFFF. Any
character that were to have a scalar value outside that range could
not be represented in the UTF-16 encoding form. But while such a
character *could* be represented in the UTF-8 encoding form (or
as UCS-4 directly), it could not interoperate with a UTF-16 Unicode
implementation. Since it is extremely important for such implementations
that the encoding forms be losslessly interconvertible to each other,
the Unicode Standard explicitly constrains the ranges interpreted
in the other encoding forms.

For the Unicode Standard, the following UTF-8 ranges are meaningful:

U-00000000 .. U-0000007F 00 .. 7F
U-00000080 .. U-000007FF C2 80 .. DF BF
U-00000800 .. U-0000D7FF E0 A0 80 .. ED 9F BF
U-0000E000 .. U-0000FFFD EE 80 80 .. EF BF BD
U-00010000 .. U-0010FFFF F0 90 80 80 .. F4 8F BF BF

For 10646-1, incorporating Amendment 2, the additional UTF-8 ranges
are also meaningful:

U-00110000 .. U-001FFFFF F4 90 80 80 .. F7 BF BF BF
U-00200000 .. U-03FFFFFF F8 88 80 80 80 .. FB BF BF BF BF
U-04000000 .. U-7FFFFFFF FC 84 80 80 80 80 .. FB BF BF BF BF BF

This has implications for how a Unicode-conformant implementation of
UTF-8 is done. In particular, a Unicode implementation will typically
not expect or convert any 5- or 6-byte UTF-8 form, nor any 4-byte
form exceeding F4 8F BF BF, since such forms would convert to scalar
values exceeding U-0010FFFF -- which could not be Unicode values.
This has implications that ripple through an implementation: the
maximum expansion sizes and buffer allowances are smaller, the UTF-8
conversion algorithm is marginally simpler, and so on.

--Ken
        



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT