RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

From: Kent Karlsson (kentk@md.chalmers.se)
Date: Sun Mar 02 2003 - 05:00:23 EST

  • Next message: Kevin Brown: "Impossible combinations?"

    Michael (michka) Kaplan:
    ...
    > then the conversion will simply strip the errant characters. Note that
    > either solution meets the needs of refusal to interpret the errant
    > sequences.

    Simply stripping the errant byte sequences means that they are
    each interpreted as the empty string of characters. To me, that
    contradicts:

       "C12a When a process interprets a code unit sequence which
        purports to be in a Unicode character encoding form, it
        shall treat ill-formed code unit sequences as an error
        condition, and shall not interpret such sequences as
        characters."

    On the other hand I think C12a is too harsh. It essentially
    requires either an error stop, or at least division of the
    input into a sequence of runs of text with possible error
    byte (for UTF-8) sequences at the borders between the runs.
    I think it would be ok to replace errant byte sequence with
    characters that indicate that there may have been an error
    (which excludes the empty string). SUBSTITUTE ("SUB is used
    in the place of a character [sic] that has been found to be
    invalid or in error, SUB is intended to be introduced by
    automatic means") seem to fit that.

    (Ken's "Titan" discussion earlier is at a much lower "protocol
    level"; byte string, or even bit string level).

                    /kent k



    This archive was generated by hypermail 2.1.5 : Sun Mar 02 2003 - 05:54:38 EST