To:

UTC

From:

Mark Davis

Date:

2000-10-15

Re:

UTF-8 and "Non-Shortest Form" (R2)

The following is a proposal for tightening up the language for UTF-8 to close the "non-shortest form" issue. It is modified from the previous paper, taking in feedback from the mailing list. I suggest that two additional steps be taken:


The current C12 forbids the generation of "non-shortest form", and forbids the interpretation of illegal sequences, but not non-shortest form. We still need to allow for programs that do fast processing with no error checking where the data is guaranteed to be well-formed, but otherwise we can extend the prohibition on illegal byte sequences to be all ill-formed byte sequences, not just illegal byte sequences. We also need to change the definition of UTF-8 to make absolutely clear what is ill-formed. (The definition of UTF-8 is in Chapter 3, and duplicated in http://www.unicode.org/unicode/faq/utf_bom.html.)

To do this, we make the following normative modifications:

Modify C12 as follows:

C12

When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat ill-formed illegal byte sequences as an error condition unless the data is guaranteed to be well-formed.

 

Modify D36 as follows:

D36

UTF-8 is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of one to four bytes, as specified in Table 3.1. A byte sequence is ill-formed UTF-8 if and only if some sequence of those bytes matches one (or more) of the lines in Table 3.1b

Add the following table and text:

Table 3.1b. Ill-Formed UTF-8 Byte Sequences

 

1 Byte

2 Bytes

3 Bytes

4 Bytes

5 Bytes

1

C0 - C1

 

 

 

 

2

F5 - FF

 

 

 

 

3

C2-DF

00-7F, C0-FF

 

 

 

4

E0

00-9F, C0-FF

 

 

 

5

E1-EF

00-7F, C0-FF

 

 

 

6

F0

00-9F, C0-FF

 

 

 

7

F4

00-7F, 90-FF

 

 

 

8

E0-EF

XX

00-7F, C0-FF

 

 

9

F0-F4

XX

00-7F, C0-FF

 

 

10

F0-F4

XX

XX

00-7F, C0-FF

 

11

ED

A0-AF

XX

ED

B0-BF

Table 3.1b. lists all of the byte sequences that are ill-formed in UTF-8. The "XX" in any cell matches any byte whatsoever, otherwise the specific byte range is listed. Thus