Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Mon Mar 03 2003 - 12:09:00 EST

  • Next message: Asmus Freytag: "Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)"

    Asmus has good points about the restartability, both that it gives the API
    user the maximal flexibility, and that many times the users don't want to
    futz with such options, and just want the text converted.

    To provide maximal flexibility, an API will give the choice for illegal
    squences of (1) deleting, (2) substituting (character, escape (e.g.
    "઼", or other options), or (3) stopping with information: the reason
    for the error, the end position of the last successfully converted sequence,
    and the end position of the bad sequence. And users may want to distinguish
    between illegal sequences and missing characters in applying these options;
    that is, they may want to silently delete illegal sequences, but substitute
    a replacement character for missing characters.

    Mark
    ________
    mark.davis@jtcsv.com
    IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
    (408) 256-3148
    fax: (408) 256-0799

    ----- Original Message -----
    From: "Asmus Freytag" <asmusf@ix.netcom.com>
    To: "Mark Davis" <mark.davis@jtcsv.com>; "Kent Karlsson"
    <kentk@md.chalmers.se>; "'Michael (michka) Kaplan'" <michka@trigeminal.com>
    Cc: "'Yung-Fong Tang'" <ftang@netscape.com>; <unicode@unicode.org>
    Sent: Sunday, March 02, 2003 21:10
    Subject: Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for
    review)

    > At 07:21 AM 3/2/03 -0800, Mark Davis wrote:
    > > > "C12a When a process interprets a code unit sequence which
    > > > purports to be in a Unicode character encoding form, it
    > > > shall treat ill-formed code unit sequences as an error
    > > > condition, and shall not interpret such sequences as
    > > > characters."
    >
    > Can we agree or disagree on whether an API that returns an error code, but
    > also an output buffer that contains a simplistic conversion of the
    > erroneous sequence is or is not conformant.
    >
    > To me it seems that by setting an error flag in the return code, the API
    > has signalled that the user should not treat the output as containing
    > correct Unicode.
    >
    > Such an API design (on a low enough level) might strike the right balance
    > between between usability in many different environments and satisfying
    the
    > formal requirement.
    >
    > The ideal case is one where the converter stops in a restartable
    > configuration, allowing the client to implement (or ask for) a variety of
    > error-recovery options. However, such an interface requires a lot of
    > thought and may be difficult to implement for some
    > language/platform/library environments. Further, it may be unnecessarily
    > difficult to use for at least some conceivable clients.
    >
    > A./
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Mar 03 2003 - 12:48:20 EST