Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

From: Tex Texin (
Date: Fri Feb 28 2003 - 00:27:04 EST

  • Next message: Kenneth Whistler: "Re: Unicode 4.0 BETA available for review"

    Kenneth Whistler wrote:
    > Yes, it is true. All the standard *mandates* is what I quoted
    > previously in this thread:
    > "C12a When a process interprets a code unit sequence which purports
    > to be in a Unicode character encoding form, it shall treat
    > ill-formed code unit sequences as an error condition, and
    > shall not interpret such sequences as characters."
    > > Is it ok then, if I detect an unpaired surrogate, mutter
    > > "oops I have an error" and then drop that surrogate and continue processing
    > > the file, resulting in a valid utf-8 file?
    > Hmm, I think you may be mixing the UTF-16 case with the UTF-8
    > case, but...

    Ken, thanks for the reply.
    I thought at some point along the way this thread was discussing utf-16 to
    utf-8 conversion, which is where I was coming from. (Must've glommed some
    threads or even some lists together.)

    I certainly agree that reporting an error is the right design. However, there
    is software out there that didn't anticipate an error could be generated
    during the conversion. With the advent of surrogates and the clarification of
    how UTF-8 is to be generated for surrogates, it becomes an issue, but can be
    difficult to address when the upper layers aren't prepared for it. Anyway, for
    some reason I thought the situation was also counter to the standard. Now I
    know it is just bad design.


    Tex Texin   cell: +1 781 789 1898
    Xen Master                
    Making e-Business Work Around the World

    This archive was generated by hypermail 2.1.5 : Fri Feb 28 2003 - 01:15:14 EST