Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

From: Yung-Fong Tang (
Date: Fri Feb 28 2003 - 13:21:01 EST

  • Next message: Mete Kural: "Unicode Arabic Rendering Problem"

    Kenneth Whistler wrote:

    >Think of it this way. Does anyone expect the ASCII standard to tell,
    >in detail, what a process should or should not do if it receives
    >data which purports to be ASCII, but which contains an 0x80 byte
    >in it? All the ASCII standard can really do is tell you that
    >0x80 is not defined in ASCII, and a conformant process shall not
    >interpret 0x80 as an ASCII character. Beyond that, it is up to
    >the software engineers to figure out who goofed up in mislabelling
    >or corrupting the data, and what the process receiving the bad data
    >should do about it.
    That is not a good comparision. ASCII is a single byte character code
    standard. And when I got a 0x80 in ASCII string, I know where is the
    boundary- the boundary is the whole 8-bits of that 0x80 is bad. The
    scope is not the first 3 bits nor 9 bits- but the 8 bits data. I cannot
    tell the rest of the data is good or bad, but I know ASCII is only
    8-bits and 8 bits only.

    Same thing for JIS x0208 (a TWO and only TWO bytes character set, not a
    variable length character set). If I am processing a ISO-2022-JP message
    and in the JIS x0208 mode and I got a 0x24 0xa8 I know the boundary of
    that problem is 16 bits, not 8 -bits nor 32 bits.

    When you deal with encoding which need states (ISO-2022, ISO-2022-JP,
    etc) or variable length encoding (Shift_JIS, Big5, UTF-8), then the
    situration is different.

    This archive was generated by hypermail 2.1.5 : Fri Feb 28 2003 - 14:01:24 EST