From: Kent Karlsson (email@example.com)
Date: Sun Mar 02 2003 - 05:00:23 EST
Michael (michka) Kaplan:
> then the conversion will simply strip the errant characters. Note that
> either solution meets the needs of refusal to interpret the errant
Simply stripping the errant byte sequences means that they are
each interpreted as the empty string of characters. To me, that
"C12a When a process interprets a code unit sequence which
purports to be in a Unicode character encoding form, it
shall treat ill-formed code unit sequences as an error
condition, and shall not interpret such sequences as
On the other hand I think C12a is too harsh. It essentially
requires either an error stop, or at least division of the
input into a sequence of runs of text with possible error
byte (for UTF-8) sequences at the borders between the runs.
I think it would be ok to replace errant byte sequence with
characters that indicate that there may have been an error
(which excludes the empty string). SUBSTITUTE ("SUB is used
in the place of a character [sic] that has been found to be
invalid or in error, SUB is intended to be introduced by
automatic means") seem to fit that.
(Ken's "Titan" discussion earlier is at a much lower "protocol
level"; byte string, or even bit string level).
This archive was generated by hypermail 2.1.5 : Sun Mar 02 2003 - 05:54:38 EST