L2/08-150 - Ill Formed Sequences

L2/08-150

From: Mark Davis
Date: Fri, Apr 11, 2008
Subject: Recommendations for handling ill-formed sequences

In converting or validating Unicode, there is no requirement that an ill-formed sequence be replaced by U+FFFD characters; an application can, for example, throw an exception. However, when replacement is done, we should at least indicate what the recommended practice is, so that people can require conformance to that practice for interoperability. (Following the proposal is an email trail that sparked this proposal.)

Here is a proposal for adding such a recommendation to a future version of the standard, and to an FAQ in the meantime. (The wording is draft, and would be refined by the editorial committee.)

When replacing an ill-formed sequence by one or more U+FFFD characters, the recommended practice is to progress through the sequence as follows, where at each byte:

If the byte cannot start a minimal well-formed code unit subsequence (D85a), skip that byte and emit one U+FFFD character.

Otherwise, find the longest sequence of bytes that are at the start of some minimal well-formed code unit subsequence (D85a), then skip them and emit one U+FFFD character.

For example, in UTF-8 each the following ill-formed subsequences would be replaced by a single U+FFFD, given a following byte. The ! means that a byte is missing (end of the byte sequence) or not within the given range. Typically this is !80..BF; exceptions are underlined below.


Sequences to be replaced by U+FFFD			If followed by
80..C1			!00..FF
C2..DF			!80..BF
E0			!A0..BF
E1..EC			!80..BF
ED			!80..9F
EE..EF			!80..BF
F0			!90..BF
F1..F3			!80..BF
F4			!80..8F
F5..FF			!00..FF

E0	A0..BF		!80..BF

E1..EC	80..BF		!80..BF
ED	80..9F		!80..BF

EE..EF	80..BF		!80..BF

F0	90..BF		!80..BF
F1..F3	80..BF		!80..BF

F4	80..8F		!80..BF

F0	90..BF	80..BF	!80..BF
F1..F3	80..BF	80..BF	!80..BF

F4	80..8F	80..BF	!80..BF

Comment on the above from Øistein E. Andersen:

... your proposal appears to be similar to what browsers have already implemented as well as to Markus Kuhn's notion of `malformed sequences' described in <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>. One notable difference is that overlong sequences as well as UTF-8 sequences representing surrogates and characters outside Unicode (>10FFFF) will typically map to several replacement characters according to your proposal, but to only one in Markus Kuhn's system. This difference may not be a problem in practice and your proposal may well be superior, but it might nevertheless be worthwhile to consider what current implementations do (Safari is quite close to what Markus Kuhn suggests, and I believe I have seen browsers do what your proposal suggests for the range >10FFFF) as well as what seems reasonable and not too cumbersome to specify. The comments in this paragraph may also be forwarded as you find appropriate.

===================

-----Original Message-----
Date/Time: Fri Apr 11 12:29:13 CDT 2008
Contact: <html5@xn--istein-9xa.com>
Name: Andersen
Report Type: Other Question, Problem, or Feedback Opt Subject: Error handling for UTF-8

Dear Sir or Madam,

The editor of HTML5, Ian Hickson, has expressed that he would like Unicode to define error handling for UTF-8 in more detail, more specifically that any byte stream labelled as UTF-8 unambiguously map to a sequence of Unicode characters (assuming that erroneous byte sequences are handled by insertion of U+FFFD characters).
This is not currently (per Unicode 5.1) the case since the number of U+FFFD characters is left undefined.

The following quotes are from some of the e-mails sent to public-html@w3.org concerning this issue.

Ian Hickson:
[The Unicode standards] should define error handling, and are defective if they don't.
---<http://lists.w3.org/Archives/Public/public-html/2008Feb/0408.html>

Ian Hickson:
The point is that Unicode _doesn't_ define exactly how many bytes form one
ill-formed sequence. Unicode doesn't define the error handling in enough
detail to get interoperable handling of arbitrary non-conforming byte
streams.
---<http://lists.w3.org/Archives/Public/public-html/2008Feb/0437.html>

[My comment: Unicode 5.1 _does_ define the concept of an ill-formed sequence, but this does not completely solve the issue given that the number of replacement characters to emit remains undefined.]

Anne van Kesteren:
I agree that it would be ideal if for input 'charset' and 'byte stream',
output 'character stream' is always identical regardless of what
implementation you pick, but the [Unicode] specification does not seem
to be developed with that in mind.
---<http://lists.w3.org/Archives/Public/public-html/2008Apr/0191.html>

Thanks in advance for considering this. Retroactively modifying conformance criteria may not be an attractive option, but a clear suggestion for new implementations to follow would also be useful.

Yours faithfully,
Øistein E. Andersen