L2/07-351


Title: Modification of Text Regarding Ill-formed Code Unit Sequences

Source: Ken Whistler and Mark Davis

Date: October 11, 2007

Action: For consideration by the UTC


This proposal is the follow-up on our action item 111-A012,
to draft proposed language to the effect of specifying the
interpretation rules for UTF ill-formed sequences.

After considering various textual approaches, we have
consensus that the following draft text can provide the
clarification required, particularly regarding the behavior
of UTF-8 conversion processes dealing with ill-formed
sequences, while keeping the modification of existing
text to a minimum. In particular, this proposed text
leaves all *existing* conformance clauses and definitions
unchanged, but adds two new definitions and a fair amount
of clarifying text making use of those new definitions.

We propose that the UTC approve this draft specifically
for addition to Unicode 5.1.

The draft text below includes editorial comments. We anticipate
that after undergoing editing for inclusion in the actual
documentation page for Unicode 5.1, the text change will
simply be written in terms of replacing X..Y paragraphs on
pp. 100 and 101 with the completely written out and
formatted replacement text.

The Unicode 5.1 documentation would also include a short
paragraph explaining why the UTC has amended and extended
this text -- presuming that the text change is approved.

The following is the detailed draft for the  text change:


========================= draft text =======================

[[ Modify the existing text on p. 100, in Chapter 3, from
D84 through D86, as follows: ]]

D84 <it>Ill-formed:</it> A Unicode code unit sequence that
purports to be in a Unicode encoding form is called
<it>ill-formed</it> if and only if it does <it>not</it>
follow the specification of that Unicode encoding form.

[[ Note: D84 is unchanged from the existing text ]]

[[ Keep the two existing bullets as is. ]]

D84a <it>Ill-formed code unit subsequence:</it>
A non-empty subsequence of a Unicode code unit sequence X
which does not contain any code units which also belong
to any well-formed subsequence of X.

* In other words, an ill-formed code unit subsequence
cannot overlap with a well-formed subsequence.

D85 <it>Well-formed:</it> A Unicode code unit sequence
that purports to be in a Unicode form is called
<it>well-formed</it> if and only if it <it>does</it>
follow the specification of that Unicode encoding form.

[[ Note: D85 is also unchanged from the existing text ]]

D85a <it>Well-formed code unit subsequence:</it>
A non-empty, well-formed subsequence of a well-formed Unicode code
unit sequence.

[[ Replace the existing bullet for D85 with the following
text, unbulleted: ]]

Any Unicode code unit sequence can be partitioned
into subsequences that are either well-formed or
ill-formed. The sequence as a whole is well-formed if and
only if it contains no ill-formed subsequence.
The sequence as a whole is ill-formed if and only if
it contains at least one ill-formed subsequence.

D86 <it>Well-formed UTF-8 code unit sequence:</it> A
well-formed Unicode code unit sequence of UTF-8 code units.

[[ Note: D86 is unchanged from the existing text ]]

[[ Add the specific examples of well-formed and ill-formed
UTF-8 here, as follows ]]

* The UTF-8 code unit sequence <41 C3 B1 42> is well-formed,
because it can be partitioned into subsequences, all of
which match the specification for UTF-8 in Table 3-7. It
consists of the following well-formed subsequences:
<41>, <C3 B1>, and <42>.

* The UTF-8 code unit sequence <41 C2 C3 B1 42> is ill-formed,
because it contains one ill-formed subsequence. There is
no subsequence for the C2 byte which matches the specification for
UTF-8 in Table 3-7. The code unit sequence is partitioned
into one well-formed code unit subsequence, <41>, followed by
one ill-formed code unit subsequence, <C2>, followed by
two well-formed code unit subsequences, <C3 B1> and <42>.

* In isolation, the UTF-8 code unit sequence <C2 C3> would
be ill-formed, but in the context of the UTF-8 code
unit sequence <41 C2 C3 B1 42>, <C2 C3> does not constitute
an ill-formed code unit subsequence, because the C3 byte is
actually the first byte of the well-formed UTF-8 code
unit subsequence <C3 B1>. Ill-formed code unit subsequences
do not overlap with well-formed code unit subsequences.

[[ Existing text continues unchanged from this point. ]]

===========================================================

[[ Replace the existing paragraph on p. 101 just above
Table 3-4 with the following text: ]]

If a Unicode string <it>purports</it> to be <it>in</it> a
Unicode encoding form, then it must not contain any ill-formed
code unit subsequence.

If a process which verifies that a Unicode string is in a
Unicode encoding form encounters an ill-formed code unit
subsequence in that string, then it must not identify that
string as being in that Unicode encoding form.

[[ Those two paragraphs are only minor modifications of the
existing text, to make use of the ill-formed code
unit subsequence definition. ]]

A process which interprets a Unicode string must not
interpret any ill-formed code unit subsequences in the
string as characters. (See conformance clause C10.)
Furthermore, such a process must not treat any adjacent
well-formed code unit sequences as being part of
those ill-formed code unit sequences.

The most important consequence of this requirement on processes
is illustrated by UTF-8 conversion processes, which
interpret UTF-8 code unit sequences as Unicode character
sequences. Suppose that a UTF-8 converter is iterating
through an input UTF-8 code unit sequence. If the converter
encounters an ill-formed UTF-8 code unit sequence which
starts with a valid first byte, but which does not continue
with valid successor bytes (see Table 3-7), it <it>must not</it>
consume the successor bytes as part of the ill-formed
subsequence whenever those successor bytes themselves
constitute part of a well-formed UTF-8 code unit subsequence.

If an implementation of a UTF-8 conversion process stops at
the first error encountered, without reporting the end of
any ill-formed UTF-8 code unit subsequence, then the
requirement makes little practical difference. However,
the requirement does introduce a significant constraint
if the UTF-8 converter continues past the point of a
detected error, perhaps by substituting one or more U+FFFD
replacement characters for the uninterpretable, ill-formed
UTF-8 code unit subsequence. For example, with the input
UTF-8 code unit sequence <C2 41 42>, such a UTF-8 conversion
process must not return <U+FFFD> or <U+FFFD, U+0042>, since
either of those outputs would be the result of
misinterpreting a well-formed subsequence as being part of
the ill-formed subsequence. What is expected is that such
a process should return <U+FFFD, U+0041, U+0042>.

For a UTF-8 conversion process to consume valid successor
bytes is not only non-conformant, but also leaves the
converter open to security exploits. See UTS #36,
Unicode Security Guidelines.

Although a UTF-8 conversion process is required to never
consume well-formed subsequences as part of its error
handling for ill-formed subsequences, such a process is
not otherwise constrained in how it deals with any ill-formed
subsequence itself. An ill-formed subsequence consisting of
more than one code unit could be treated as a single
error or as multiple errors. For example, in processing
the UTF-8 code unit sequence <F0 80 80 41>, the only
requirement on a converter is that the <41> be processed
and correctly interpreted as <U+0041>. The converter could
return <U+FFFD, U+0041>, handling <F0 80 80> as a single
error, or <U+FFFD, U+FFFD, U+FFFD, U+0041>, handling each
byte of <F0 80 80> as a separate error, or could take
other approaches to signalling <F0 80 80> as an
ill-formed code unit subsequence.

======================= end of draft =======================