UTF-8 and "Non-Shortest Form" (R2)

To:	UTC
From:	Mark Davis
Date:	2000-10-15
Re:	UTF-8 and "Non-Shortest Form" (R2)

The following is a proposal for tightening up the language for UTF-8 to close the "non-shortest form" issue. It is modified from the previous paper, taking in feedback from the mailing list. I suggest that two additional steps be taken:

the UTF-8 program in http://www.unicode.org/Public/PROGRAMS/CVTUTF/ be upgraded.
a conformance test file be added to Unicode 3.1.

The current C12 forbids the generation of "non-shortest form", and forbids the interpretation of illegal sequences, but not non-shortest form. We still need to allow for programs that do fast processing with no error checking where the data is guaranteed to be well-formed, but otherwise we can extend the prohibition on illegal byte sequences to be all ill-formed byte sequences, not just illegal byte sequences. We also need to change the definition of UTF-8 to make absolutely clear what is ill-formed. (The definition of UTF-8 is in Chapter 3, and duplicated in http://www.unicode.org/unicode/faq/utf_bom.html.)

To do this, we make the following normative modifications:

Modify C12 as follows:

C12

When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat ill-formed ~~illegal~~ byte sequences as an error condition unless the data is guaranteed to be well-formed.

Modify D36 as follows:

D36	UTF-8 is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of one to four bytes, as specified in Table 3.1. A byte sequence is ill-formed UTF-8 if and only if some sequence of those bytes matches one (or more) of the lines in Table 3.1b

Add the following table and text:

Table 3.1b. Ill-Formed UTF-8 Byte Sequences
	1 Byte	2 Bytes	3 Bytes	4 Bytes	5 Bytes
1	C0 - C1
2	F5 - FF
3	C2-DF	00-7F, C0-FF
4	E0	00-9F, C0-FF
5	E1-EF	00-7F, C0-FF
6	F0	00-9F, C0-FF
7	F4	00-7F, 90-FF
8	E0-EF	XX	00-7F, C0-FF
9	F0-F4	XX	00-7F, C0-FF
10	F0-F4	XX	XX	00-7F, C0-FF
11	ED	A0-AF	XX	ED	B0-BF

Table 3.1b. lists all of the byte sequences that are ill-formed in UTF-8. The "XX" in any cell matches any byte whatsoever, otherwise the specific byte range is listed. Thus

the sequence <E3, 80, C0> is ill-formed because the three bytes match line 8.
the sequence <E3, FF, C0> is ill-formed for a number of reasons: the three bytes match lines 5 and 8, the second byte matches line 2, and the third byte matches line 1.
the sequence <ED, A0, 80, ED, B0, 80> is ill-formed because the first 5 bytes match line 11.
the sequence <ED, A0, 80, 7F> is well-formed — no subsequence matches any line.