RE: Encoding Standard (mostly complete) from Doug Ewell on 2012-04-19 (Unicode Mail List Archive)

From: Doug Ewell <doug_at_ewellic.org>
Date: Thu, 19 Apr 2012 10:58:57 -0700

Copying the Unicode mailing list.

Masatoshi Kimura <VYV03354 at nifty dot ne dot jp> wrote:

> (2012/04/19 9:33), Doug Ewell wrote:
>> Given the sequence F8 80 80 80 80, the Unicode Standard specifies
>> that a decoder should recognize F5 as an invalid UTF-8 code unit, do
>> whatever it does on an error condition, and then continue with the
>> next byte. This will generate 5 error conditions if handling of
>> errors includes trying to continue.
>
> Where TUS defines this? It seems to contradict TUS 6.1.0 p.96:
> http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf#page=42
> |Although a UTF-8 conversion process is required to never consume
> |well-formed subsequences as part of its error handling for ill-formed
> |subsequences, such a process is not otherwise constrained in how it
> |deals with any ill-formed subsequence itself. An ill-formed
> |subsequence consisting of more than one code unit could be treated as
> |a single error or as multiple errors. For example, in processing the
> |UTF-8 code unit sequence <F0 80 80 41>, the only formal requirement
> |mandated by Unicode conformance for a converter is that the <41> be
> |processed and correctly interpreted as <U+0041>. The converter could
> |return <U+FFFD, U+0041>, handling <F0 80 80> as a single error, or
> |<U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each byte of <F0 80 80> as
> |a separate error, or could take other approaches to signalling <F0 80
> |80> as an ill-formed code unit subsequence.

I remembered reading a statement from UTC that interpretation of an ill-
formed sequence was supposed to terminate as soon as the sequence was
determined to be ill-formed. Conformance definition C10 does say:

> For example, in UTF-8 every byte of the form 110xxxxx₂ must be
> followed with a byte of the form 10xxxxxx₂. A sequence such as
> <110xxxxx₂ 0xxxxxxx₂> is illegal, and must never be generated. When
> faced with this illegal byte sequence while transforming or
> interpreting, a UTF-8 conformant process must treat the first byte
> 110xxxxx₂ as an illegal termination error: for example, either
> signaling an error, filtering the byte out, or representing the byte
> with a marker such as FFFD (REPLACEMENT CHARACTER). In the latter two
> cases, it will continue processing at the second byte 0xxxxxxx₂.

A lead byte of 11111000₂ is ill-formed.

And in fact, the section of TUS that Masatoshi quoted goes on to say:

> Using the definition for maximal subpart, the best practice can be
> stated simply as:
>
> Whenever an unconvertible offset is reached during conversion of a
> code unit sequence:
>
> 1. The maximal subpart at that offset should be replaced by a single
> U+FFFD.
>
> 2. The conversion should proceed at the offset immediately after the
> maximal subpart.

However, this description does use the word "should," not "must," and it
goes on (on the same page) to offer a table with three "possible
alternative approaches" for mapping an ill-formed UTF-8 sequence into
characters. It recommends the method described above, but allows the
other two.

So the bottom line is that Masatoshi is right: the Unicode Standard does
not specify that a decoder *must* respond to an invalid lead byte as I
said, only that it *should*. I agree that this is unnecessarily vague.

Whether this calls for a complete recasting of the definition of UTF-8
by WHATWG, or by any individual contributors therein, is of course a
different matter.

> It is exactly a purpose of Encoding Standard to avoid these kind of
> vagueness.

Again, I'm not sure whether it is within the authority or responsibility
of WHATWG or any individual to provide a "better" definition of a
Unicode encoding form than that provided by Unicode. I do understand the
desire to nail down the various legacy encodings, such as Shift-JIS,
that have been interpreted over the years in very flexible and confusing
ways. I don't think UTF-8 falls into this category at all.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell

Received on Thu Apr 19 2012 - 13:01:48 CDT

This archive was generated by hypermail 2.2.0 : Thu Apr 19 2012 - 13:01:49 CDT