L2/00-374R2

To: UTC
From: Mark Davis
Date: 2000-10-15
Re: UTF-8 and "Non-Shortest Form" (R2)

The following is a proposal for tightening up the language for UTF-8 to close the "non-shortest form" issue. It is modified from the previous paper, taking in feedback from the mailing list. I suggest that two additional steps be taken:


The current C12 forbids the generation of "non-shortest form", and forbids the interpretation of illegal sequences, but not non-shortest form. We still need to allow for programs that do fast processing with no error checking where the data is guaranteed to be well-formed, but otherwise we can extend the prohibition on illegal byte sequences to be all ill-formed byte sequences, not just illegal byte sequences. We also need to change the definition of UTF-8 to make absolutely clear what is ill-formed. (The definition of UTF-8 is in Chapter 3, and duplicated in http://www.unicode.org/unicode/faq/utf_bom.html.)

To do this, we make the following normative modifications:

Modify C12 as follows:

C12 When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat ill-formed illegal byte sequences as an error condition unless the data is guaranteed to be well-formed.
 

Modify D36 as follows:

D36 UTF-8 is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of one to four bytes, as specified in Table 3.1. A byte sequence is ill-formed UTF-8 if and only if some sequence of those bytes matches one (or more) of the lines in Table 3.1b

Add the following table and text:

Table 3.1b. Ill-Formed UTF-8 Byte Sequences
  1 Byte 2 Bytes 3 Bytes 4 Bytes 5 Bytes
1 C0 - C1        
2 F5 - FF        
3 C2-DF 00-7F, C0-FF      
4 E0 00-9F, C0-FF      
5 E1-EF 00-7F, C0-FF      
6 F0 00-9F, C0-FF      
7 F4 00-7F, 90-FF      
8 E0-EF XX 00-7F, C0-FF    
9 F0-F4 XX 00-7F, C0-FF    
10 F0-F4 XX XX 00-7F, C0-FF  
11 ED A0-AF XX ED B0-BF

Table 3.1b. lists all of the byte sequences that are ill-formed in UTF-8. The "XX" in any cell matches any byte whatsoever, otherwise the specific byte range is listed. Thus