Re: UTF-16 inside UTF-8

Date: Tue Nov 04 2003 - 12:45:24 EST

    In a message dated 11/4/2003 6:44:05 AM Pacific Standard Time, writes:

    What is a conforming application supposed to do if, when decoding a UTF-8
    stream (or indeed a UTF-32 stream, etc.), it encounters a sequence of bytes which
    decodes to U+D800, U+DF00 ?

    Of course, if such a sequence were encountered during UTF-16 processing it
    would be pretty obvious, but I'm not talking UTF-16 any more. At least, not
    directly. Nonetheless, such a sequence could arise if Application A encodes text
    to a file using UTF-16, which is then read by Application B (a very old, legacy
    application, unaware of the existence of codepoints above U+FFFF) and
    re-saved in UTF-8.
    It is clear that Application B is not a conforming application to Unicode 3.2
    or Unicode 4.0, right?
    It is clear that Application A is a conforming application to Unicode 3.2 or
    Unicode 4.0, right?

    If you have application C, which read whatever the application B write, then
    it should not accept illegal UTF-8 sequence which use 3 bytes to encode U+D800
    and another 3 bytes to encode U+DF00. This is clear in Unicode 3.2 or Unicode

    This question generalises to ... should all encoding schemes treat surrogate
    pairs as surrogate pairs, or just UTF-16 ?

    This question generalises further still, to ... do the phrases "surrogate
    character" and "surrogate pair" have any meaning whatsoever outside UTF-16?

