UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Feb 27 2003 - 15:53:59 EST

  • Next message: Yung-Fong Tang: "Re: Unicode 4.0 BETA available for review"

    Frank Tang responded to Kent Karlsson's response:

    > The problem I need to deal with is not GENERATE those UTF-8, but how to
    > handle these DATA when my code receive it. For example, when I receive a
    > 10K UTF-8 file which have 1000 lines of text, if there are one UTF-8
    > sequence in the line 990 are ill-formed, should I fire the "error" for
    > 1. the whole file (10K, 1000 lines),
    > 2. all the line after line 899,
    > 3. the line 990 itslef,

    etc. etc.

    >
    > I there are others way you can scope the ERROR, I probably can continue
    > it on and on and tell you 10-20 other way to scope it if I spend 20 more
    > minutes.
    >
    > I do believe the error handling should be application specific.

    Absolutely. Error handling is a matter of software design, and not
    something mandated in detail by the Unicode Standard.

    If you write software which handles a GIF image, and there is
    a corrupted byte in the middle of a 118K GIF file, you don't go
    to the GIF specification itself, e.g.,
    http://www.w3.org/Graphics/GIF/spec-gif87.txt
    to tell your software what to do after it has processed the first
    59K bytes (or whatever). The GIF specification just tells you
    what a well-formed GIF image is.

    Likewise, the Unicode Standard tells you what a well-formed
    UTF-8 byte sequence is. But it is the software designer who has
    to be smart about determining what his/her software will do when
    it encounters an error condition and finds itself dealing
    with a sequence which is ill-formed according to the specification
    of UTF-8 in the Unicode Standard.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Feb 27 2003 - 16:40:03 EST