From: Kenneth Whistler (firstname.lastname@example.org)
Date: Thu Feb 27 2003 - 15:42:43 EST
Frank Tang asked:
> >> This discussion has been centered around UTF-8. But I hope the
> >>corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0:
> >>. for UTF-32: occurrences of 'surrogates' are ill-formed.
> How about UTF-32 sequence which the 4 bytes represent value > U+10FFFF ?
> Are they considered ill-formed? Should they?
Yes, they are ill-formed.
Since all the encoding forms are based on the Unicode scalar values,
and since the Unicode scalar values are *defined* to be the
range 0x0000..0xD7FF, 0xE000..0x10FFFF, any attempt to represent
a code point higher than U+10FFFF in *any* encoding form is
This will be called out explicitly in the Unicode 4.0 text, in
case anyone still has the question:
" * Any UTF-32 code unit greater than 0010FFFF<sub>16</sub> is
I can keep answering these questions, but I can also assure
everyone that the UTC worked *very* hard this time around to
make the character encoding model much clearer in the Unicode 4.0
text, and to anticipate all these edge cases.
This archive was generated by hypermail 2.1.5 : Thu Feb 27 2003 - 16:27:20 EST