UTF-8 (was Re: Mercury News: Hawaiian on a Mac)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Sep 05 2002 - 15:49:53 EDT


Markus Scherer responded:

> Stefan Persson wrote:
>
> > This links to a different page on the same server:
> >
> > http://www.cl.cam.ac.uk/~mgk25/unicode.html
> >
> > That page contains a strange UTF-8 table:
> > ...
> > The last two byte sequences are invalid.
>
>
> Markus Kuhn's page shows the original ISO 10646 definition.
                               ^^^^^^^^
and still current ISO/IEC 10646 definition. Table D.1 in Annex D
"UCS Transformation Format 8 (UTF-8)".

Note that the definition of the 5- and 6-byte UTF-8 sequences
for code positions past U-001FFFFF is essentially harmless,
as ISO/IEC 10646 now contains explicit language indicating
the non-intention to encode any characters at code positions
past U-0010FFFF. So the definition of the 5- and 6-byte sequences
is vacuous -- no such sequence will ever be a valid representation
of an *encoded character* in 10646.

> This necessarily includes all codes up to 7FFFFFFF.
> It also includes D800..DFFF, which is not allowed in Unicode 3.2
> and the RFC on UTF-8, and I think implicitly not allowed in ISO 10646.

They are *explicitly* not allowed in UTF-8 in ISO/IEC 10646 as well.
From Clause D.4 "Mapping from UCS-4 form to UTF-8 form":

  "Values of x in the range 0000 D800 .. 0000 DFFF are
   reserved for the UTF-16 form and do not occur in UCS-4.
   The mappings of these code positions in UTF-8 are undefined."

--Ken



This archive was generated by hypermail 2.1.2 : Thu Sep 05 2002 - 17:32:58 EDT