Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

From: Kenneth Whistler (
Date: Thu Feb 27 2003 - 18:04:06 EST

  • Next message: Yung-Fong Tang: "question about Windows-1252 and Unicode mapping"

    Tex Texin asked:

    > Hmm, is that true?

    Yes, it is true. All the standard *mandates* is what I quoted
    previously in this thread:

    "C12a When a process interprets a code unit sequence which purports
          to be in a Unicode character encoding form, it shall treat
          ill-formed code unit sequences as an error condition, and
          shall not interpret such sequences as characters."

    > Is it ok then, if I detect an unpaired surrogate, mutter
    > "oops I have an error" and then drop that surrogate and continue processing
    > the file, resulting in a valid utf-8 file?

    Hmm, I think you may be mixing the UTF-16 case with the UTF-8
    case, but...

    If that is what you tell your customers, clients, or calling APIs that
    you are explicitly doing to corrupted, ill-formed UTF-8 data, and
    if they think that is o.k., then you've got two happy users of
    the standard.*

    The problem, of course, is that if you are implementing a public
    API or service, just dropping corrupted bytes in a sequence can
    create security problems or other difficulties, and people would
    be well-advised to avoid such software that claims to
    "auto-fix corrupted data", at least in such a crude way.

    > I thought for some reason this was prohibited, but if the standard does not
    > prescribe error handling, than this seems legit.

    The basic constraint is that "conformant processes cannot interpret
    ill-formed code unit sequences." Beyond that, the UTC has, from time
    to time, tried to provide some guidance regarding what is or is
    not reasonable for a process to do when confronted with bad data
    of this type, but spelling out in any kind of detail what a process
    should do with bad data is essentially out of scope for the standard.

    Think of it this way. Does anyone expect the ASCII standard to tell,
    in detail, what a process should or should not do if it receives
    data which purports to be ASCII, but which contains an 0x80 byte
    in it? All the ASCII standard can really do is tell you that
    0x80 is not defined in ASCII, and a conformant process shall not
    interpret 0x80 as an ASCII character. Beyond that, it is up to
    the software engineers to figure out who goofed up in mislabelling
    or corrupting the data, and what the process receiving the bad data
    should do about it.


    *Example: You have dedicated signal-processing software dealing
    with a data link messaging astronauts orbiting Titan. That data
    link is using UTF-8, uncompressed, for some reason, and you are
    having trouble with data dropouts. Your solution is to transmit
    every message 3 times, drop any corrupted sections, and then
    use a best match algorithm of some sort to compare the 3
    messages and fill in any missing sections from the versions that
    are not corrupted, thus reconstructing all the gaps. Of course,
    there are much better approaches to self-correcting data
    transmission, but you get the idea. This would be a perfectly
    valid and conformant way to use UTF-8 data.

    > tex
    > Kenneth Whistler wrote:
    > > Absolutely. Error handling is a matter of software design, and not
    > > something mandated in detail by the Unicode Standard.

    This archive was generated by hypermail 2.1.5 : Thu Feb 27 2003 - 18:41:03 EST