Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Sun Mar 02 2003 - 10:21:21 EST

Next message: Roozbeh Pournader: "Re: Impossible combinations?"

Previous message: Michael Everson: "Re: Please see my latest proposal"
In reply to: Kent Karlsson: "RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)"
Next in thread: Michael \(michka\) Kaplan: "Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)"
Reply: Michael \(michka\) Kaplan: "Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)"
Reply: Asmus Freytag: "Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I agree with Kent that it is somewhat less robust to simply remove
ill-formed sequences, since it removes any indication that the data was
corrupted. Either better to signal an error, or insert some other indication
like a REPLACEMENT CHARACTER or SUB at that point. (And in my reading, C12a
does allow that; you are not interpreting the sequence as a character, you
are replacing a host of possible errant sequences by an error indicator.)
But the final decision should be made by the user of the API, since the
desired behavior may vary depending on the environment.

Mark
________
mark.davis@jtcsv.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

----- Original Message -----
From: "Kent Karlsson" <kentk@md.chalmers.se>
To: "'Michael (michka) Kaplan'" <michka@trigeminal.com>
Cc: "'Yung-Fong Tang'" <ftang@netscape.com>; <unicode@unicode.org>
Sent: Sunday, March 02, 2003 02:00
Subject: RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for
review)

>
>
> Michael (michka) Kaplan:
> ...
> > then the conversion will simply strip the errant characters. Note that
> > either solution meets the needs of refusal to interpret the errant
> > sequences.
>
> Simply stripping the errant byte sequences means that they are
> each interpreted as the empty string of characters. To me, that
> contradicts:
>
> "C12a When a process interprets a code unit sequence which
> purports to be in a Unicode character encoding form, it
> shall treat ill-formed code unit sequences as an error
> condition, and shall not interpret such sequences as
> characters."
>
> On the other hand I think C12a is too harsh. It essentially
> requires either an error stop, or at least division of the
> input into a sequence of runs of text with possible error
> byte (for UTF-8) sequences at the borders between the runs.
> I think it would be ok to replace errant byte sequence with
> characters that indicate that there may have been an error
> (which excludes the empty string). SUBSTITUTE ("SUB is used
> in the place of a character [sic] that has been found to be
> invalid or in error, SUB is intended to be introduced by
> automatic means") seem to fit that.
>
> (Ken's "Titan" discussion earlier is at a much lower "protocol
> level"; byte string, or even bit string level).
>
> /kent k
>
>
>

Next message: Roozbeh Pournader: "Re: Impossible combinations?"
Previous message: Michael Everson: "Re: Please see my latest proposal"
In reply to: Kent Karlsson: "RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)"
Next in thread: Michael \(michka\) Kaplan: "Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)"
Reply: Michael \(michka\) Kaplan: "Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)"
Reply: Asmus Freytag: "Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Mar 02 2003 - 11:58:35 EST