Date: Tue Nov 04 2003 - 12:45:24 EST
In a message dated 11/4/2003 6:44:05 AM Pacific Standard Time,
What is a conforming application supposed to do if, when decoding a UTF-8
stream (or indeed a UTF-32 stream, etc.), it encounters a sequence of bytes which
decodes to U+D800, U+DF00 ?
Of course, if such a sequence were encountered during UTF-16 processing it
would be pretty obvious, but I'm not talking UTF-16 any more. At least, not
directly. Nonetheless, such a sequence could arise if Application A encodes text
to a file using UTF-16, which is then read by Application B (a very old, legacy
application, unaware of the existence of codepoints above U+FFFF) and
re-saved in UTF-8.
It is clear that Application B is not a conforming application to Unicode 3.2
or Unicode 4.0, right?
It is clear that Application A is a conforming application to Unicode 3.2 or
Unicode 4.0, right?
If you have application C, which read whatever the application B write, then
it should not accept illegal UTF-8 sequence which use 3 bytes to encode U+D800
and another 3 bytes to encode U+DF00. This is clear in Unicode 3.2 or Unicode
This question generalises to ... should all encoding schemes treat surrogate
pairs as surrogate pairs, or just UTF-16 ?
This question generalises further still, to ... do the phrases "surrogate
character" and "surrogate pair" have any meaning whatsoever outside UTF-16?
Frank Yung-Fong Tang
System Architect, Iñtërnâtiônàl Dèvélôpmeñt, AOL Intèrâçtívë Sërviçes
AIM:yungfongta mailto:firstname.lastname@example.org Tel:650-937-2913
Yahoo! Msg: frankyungfongtan
John 3:16 "For God so loved the world that he gave his one and only Son, that
whoever believes in him shall not perish but have eternal life.
Does your software display Thai language text correctly for Thailand users?
-> Basic Conceptof Thai Language linked from Frank Tang's
Want to translate your English text to something Thailand users can
-> Try English-to-Thai machine translation at
This archive was generated by hypermail 2.1.5 : Tue Nov 04 2003 - 13:48:08 EST