Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Sun Mar 02 2003 - 10:21:21 EST

  • Next message: Roozbeh Pournader: "Re: Impossible combinations?"

    I agree with Kent that it is somewhat less robust to simply remove
    ill-formed sequences, since it removes any indication that the data was
    corrupted. Either better to signal an error, or insert some other indication
    like a REPLACEMENT CHARACTER or SUB at that point. (And in my reading, C12a
    does allow that; you are not interpreting the sequence as a character, you
    are replacing a host of possible errant sequences by an error indicator.)
    But the final decision should be made by the user of the API, since the
    desired behavior may vary depending on the environment.

    Mark
    ________
    mark.davis@jtcsv.com
    IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
    (408) 256-3148
    fax: (408) 256-0799

    ----- Original Message -----
    From: "Kent Karlsson" <kentk@md.chalmers.se>
    To: "'Michael (michka) Kaplan'" <michka@trigeminal.com>
    Cc: "'Yung-Fong Tang'" <ftang@netscape.com>; <unicode@unicode.org>
    Sent: Sunday, March 02, 2003 02:00
    Subject: RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for
    review)

    >
    >
    > Michael (michka) Kaplan:
    > ...
    > > then the conversion will simply strip the errant characters. Note that
    > > either solution meets the needs of refusal to interpret the errant
    > > sequences.
    >
    > Simply stripping the errant byte sequences means that they are
    > each interpreted as the empty string of characters. To me, that
    > contradicts:
    >
    > "C12a When a process interprets a code unit sequence which
    > purports to be in a Unicode character encoding form, it
    > shall treat ill-formed code unit sequences as an error
    > condition, and shall not interpret such sequences as
    > characters."
    >
    > On the other hand I think C12a is too harsh. It essentially
    > requires either an error stop, or at least division of the
    > input into a sequence of runs of text with possible error
    > byte (for UTF-8) sequences at the borders between the runs.
    > I think it would be ok to replace errant byte sequence with
    > characters that indicate that there may have been an error
    > (which excludes the empty string). SUBSTITUTE ("SUB is used
    > in the place of a character [sic] that has been found to be
    > invalid or in error, SUB is intended to be introduced by
    > automatic means") seem to fit that.
    >
    > (Ken's "Titan" discussion earlier is at a much lower "protocol
    > level"; byte string, or even bit string level).
    >
    > /kent k
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sun Mar 02 2003 - 11:58:35 EST