Re: UTF-16 inside UTF-8

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Nov 04 2003 - 11:37:17 EST

  • Next message: Peter Kirk: "Re: UTF-16 inside UTF-8"

    Jill Ramonsky wrote:

    > What is a conforming application supposed to do if, when decoding a
    > UTF-8 stream (or indeed a UTF-32 stream, etc.), it encounters a
    > sequence of bytes which decodes to U+D800, U+DF00 ?

    It should recognize that the text is not UTF-8 at all, but rather CESU-8
    (see UTR #26), whereupon it should burst into uncontrollable peals of
    laughter.

    Serious answer: It should recognize that the text is *ill-formed* UTF-8
    (definition D30) and should probably decline to process the two code
    points. If it wants to be more charitable than conformant, it MAY
    choose to reassemble them to create U+10300, but it is under no
    obligation to do so.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Tue Nov 04 2003 - 12:37:07 EST