UTF-16 inside UTF-8

From: Jill Ramonsky (Jill.Ramonsky@Aculab.com)
Date: Tue Nov 04 2003 - 09:37:00 EST

  • Next message: Doug Ewell: "Re: UTF-16 inside UTF-8"

    Hi,

    What is a conforming application supposed to do if, when decoding a
    UTF-8 stream (or indeed a UTF-32 stream, etc.), it encounters a sequence
    of bytes which decodes to U+D800, U+DF00 ?

    Of course, if such a sequence were encountered during UTF-16 processing
    it would be pretty obvious, but I'm not talking UTF-16 any more. At
    least, not directly. Nonetheless, such a sequence could arise if
    Application A encodes text to a file using UTF-16, which is then read by
    Application B (a very old, legacy application, unaware of the existence
    of codepoints above U+FFFF) and re-saved in UTF-8.

    This question generalises to ... should /all/ encoding schemes treat
    surrogate pairs as surrogate pairs, or just UTF-16 ?

    This question generalises further still, to ... do the phrases
    "surrogate character" and "surrogate pair" have any meaning whatsoever
    outside UTF-16?

    Jill



    This archive was generated by hypermail 2.1.5 : Tue Nov 04 2003 - 11:10:48 EST