Re: UTF-8 validation rules

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Sep 10 2001 - 23:09:59 EDT


David Hopwood said:

> >
> > With Unicode 3.2 (in the works), the 32 additional code points
> > at U+FDD0..U+FDEF go from unallocated status to noncharacters
> > as well.
>
> Those are non-characters in Unicode 3.1 (see D7b in UAX #27).

Yes, I stand corrected. They are *already* approved by the UTC
and have been published in Unicode 3.1.

The issue is one of synchronization with the Amendment 1 to 10646-1:2000,
which is still under ballot and which will designate the same code
points in 10646. Most of the content of Amendment 1 (the additional
characters, anyway) will appear as Unicode 3.2, but the architectural
changes, including these new designations of noncharacter code points,
are considered already in Unicode 3.1.

>
> Carl W. Brown wrote:
> | ... It seems like an interesting range for non-characters.
>
> It's for Arabic presentation forms internal to a rendering implementation,
> I assume (although it's not clear why existing private-use characters
> couldn't have been used for that).

This is incorrect. The range of noncharacters U+FDD0..U+FDEF are
not for Arabic presentation forms at all. They are noncharacters.
Internally, they could be used for anything, but they are not
to be externally interchanged, and have no public interpretation.

The choice of FDD0..FDEF as the code points for these noncharacters
was a reasonably arbitrary one, but was attempting to make use of
a contiguous range of 32 code points that couldn't reasonably be
assigned to anything else. And neither the UTC nor WG2 wants to
assign any more Arabic presentation forms!

>
> Kenneth Whistler wrote:
> > UTF-8 (and UTF-16 and UTF-32) convertors must allow the conversion
> > of noncharacter code points, but may then allow the detection of
> > their noncharacter status.
>
> Where does the standard say that conversion of these code points must
> be allowed? That would make it impossible to strictly comply with both
> Unicode 3.1 and ISO/IEC 10646-1:2000, since the latter says that U+FFFE
> and U+FFFF (but not other non-characters) are illegal in UTF-8 and must
> be rejected.

See my subsequent note. The text in 10646-1 is being corrected. It is
inconsistent as it stands, since it treats U+FFFE and U+FFFF one way,
and the other noncharacters (U+1FFFE, etc.) another.

>
> As far as I understand, according to Unicode 3.1, non-characters may be
> *either* converted or rejected.

O.k. Let me put it this way.

*Definitionally*, the encoding forms define the relationships:

UTF-32 UTF-16 UTF-8

0000FFFF <==> FFFF <==> EF BF FF

(and so on for each of the noncharacter code points)

Note that Table D.3 "Examples in hexadecimal notation" in Annex D UTF-3
in 10646-1 even explicitly lists this example! This despite the contrary
text in Note 3 to clause D.4, which claims that the UTF-8 mapping of
U-0000FFFF is undefined. (Which is what needs to be fixed in 10646.)

A convertor for UTF-8 *should* be able to do this conversion correctly.

It is another thing to decide whether an API for UTF-8 conversion will
report a non-character value as an error. That depends on the context.

In my opinion, the most robust implementation is for the convertor to
convert clean through, only reporting errors on *illegal* values
(unpaired surrogates, code points > 0x10FFFF). It should then be up
to some other piece of code to determine whether a code point is
unassigned, a noncharacter, or something else.

--Ken

>
> - --
> David Hopwood <david.hopwood@zetnet.co.uk>



This archive was generated by hypermail 2.1.2 : Mon Sep 10 2001 - 23:57:42 EDT