Re: Abstract character?

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Jul 23 2002 - 22:44:31 EDT


Kenneth Whistler <kenw at sybase dot com> wrote:

>> UTF-16 does not allow the representation of an unpaired surrogate
>> 0xD800 followed by another, coincidental unpaired surrogate 0xDC00.
>> (It maps the two to U+10000.) Among the standard UTFs, only UTF-32
>> allows the two to be treated as unpaired surrogates.
>
> Actually, not that, either.
>
>> In fact, before UTF-8 was
>> "tightened up" in 3.2, the only UTF that DID NOT permit these two
>> coincidental unpaired surrogates was UTF-16.
>>
>> UTF-8: D800 DC00 <==> ED A0 80 ED B0 80 (no longer legal)
>> UTF-32: D800 DC00 <==> 0000D800 0000DC00
>
> This is ill-formed in UTF-32, and thereby, illegal.

I'm glad to hear that unpaired surrogates are now also illegal in
UTF-32, and presumably also in UTF-16. However, I did do my homework
before writing yesterday's post, and that wasn't the impression I got,
so I sense another opportunity to tighten up the definitions before
Unicode 4.0 is released.

In UAX #28, "Unicode 3.2," the section on "Elimination of Irregular
Sequences" starts out talking about "transformation formats such as
UTF-8." However, the rest of the section deals exclusively with UTF-8;
UTF-16 and UTF-32 are not mentioned.

UAX #19, "UTF-32" (written by Mark) is listed in the header block as
having been updated to Unicode 3.2, but it does not state anywhere that
unpaired surrogates are illegal. In particular, the following passages
from UAX #19 led me to believe that all code points, from 0x0000 through
0x10FFFF inclusive, are legal in UTF-32:

"UTF-32 is restricted in values to the range 0..10FFFF<sub>16</sub>,
which precisely matches the range of characters defined in the Unicode
Standard (and other standards such as XML), and those representable by
UTF-8 and UTF-16."

"(b) An illegal UTF-32 code unit sequence is any byte sequence that
would correspond to a numeric value outside of the range 0 to
10FFFF<sub>16</sub>.

"(c) An irregular UTF-32 code unit sequence is an eight-byte sequence
where the first four bytes correspond to a high surrogate, and the next
four bytes correspond to a low surrogate. As a consequence of C12, these
irregular UTF-32 sequences shall not be generated by a conformant
process."

I suggest that the Unicode 4.0 text specifically state, in unambiguous
terms, which code points are and are not valid in UTF-8, UTF-16, and
UTF-32. And if it is true that the surrogate code points 0xD800 through
0xDFFF are illegal in UTF-32, then I suggest that UAX #18 be revised to
state this unambiguously.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 23 2002 - 21:18:14 EDT