Re: UTF-16 inside UTF-8

Date: Tue Nov 04 2003 - 12:45:24 EST

  • Next message: Sue and Maurice Bauhahn: "URL of excellent interview with the creator of an excellent Unicode font, Gentium"

    In a message dated 11/4/2003 6:44:05 AM Pacific Standard Time, writes:

    What is a conforming application supposed to do if, when decoding a UTF-8
    stream (or indeed a UTF-32 stream, etc.), it encounters a sequence of bytes which
    decodes to U+D800, U+DF00 ?

    Of course, if such a sequence were encountered during UTF-16 processing it
    would be pretty obvious, but I'm not talking UTF-16 any more. At least, not
    directly. Nonetheless, such a sequence could arise if Application A encodes text
    to a file using UTF-16, which is then read by Application B (a very old, legacy
    application, unaware of the existence of codepoints above U+FFFF) and
    re-saved in UTF-8.
    It is clear that Application B is not a conforming application to Unicode 3.2
    or Unicode 4.0, right?
    It is clear that Application A is a conforming application to Unicode 3.2 or
    Unicode 4.0, right?

    If you have application C, which read whatever the application B write, then
    it should not accept illegal UTF-8 sequence which use 3 bytes to encode U+D800
    and another 3 bytes to encode U+DF00. This is clear in Unicode 3.2 or Unicode

    This question generalises to ... should all encoding schemes treat surrogate
    pairs as surrogate pairs, or just UTF-16 ?

    This question generalises further still, to ... do the phrases "surrogate
    character" and "surrogate pair" have any meaning whatsoever outside UTF-16?

    Frank Yung-Fong Tang
    System Architect, Itrntinl Dvlpmet, AOL Intrtv Srvies
    AIM:yungfongta Tel:650-937-2913
    Yahoo! Msg: frankyungfongtan

    John 3:16 "For God so loved the world that he gave his one and only Son, that
    whoever believes in him shall not perish but have eternal life.

    Does your software display Thai language text correctly for Thailand users?
    -> Basic Conceptof Thai Language linked from Frank Tang's
    Itrntinliztin Secrets
    Want to translate your English text to something Thailand users can
    understand ?
    -> Try English-to-Thai machine translation at

    This archive was generated by hypermail 2.1.5 : Tue Nov 04 2003 - 13:48:08 EST