L2/00-414
(formerly L2/00-374R4)

 
To: UTC
From:  Mark Davis
Date: 2000-11-12
Re: UTF-8 and "Non-Shortest Form" (R4)

The following is a proposal for a corrigendum to the Unicode Standard, tightening up the language for UTF-8 to close the "non-shortest form" issue. As a part of these actions, we may also want to add a conformance test file.


Corrigendum to Unicode 3.0.1

The current C12 forbids the generation of "non-shortest form" UTF-8, and forbids the interpretation of illegal sequences, but not "non-shortest form". Where software does not correctly take this into account, security issues can arise. For example, process A performs security checks, but does not check for non-shortest forms. Process B accepts the byte sequence from process A, and transforms it into UTF-16 while interpreting non-shortest forms. The UTF-16 text may then contain characters that should have been filtered out by process A.

To address this issue, the Unicode Technical Committee has modified the definition of UTF-8 to forbid conformant implementations from interpreting the problematic non-shortest forms, and clarified some of the conformance clauses. These modifications make use of updated notation: see the Glossary for any unfamiliar terms. The UTF-8 program in http://www.unicode.org/Public/PROGRAMS/CVTUTF/ has been upgraded to reflect this corrigendum.

Add the following to the end of C12:

A conformant process shall not interpret illegal UTF code unit sequences as characters. Irregular UTF code unit sequences shall not be used for encoding any other information.

Add the following notes after C12:

Delete the second sentence in the note under D32:

For example, UTF-8 allows nonshortest code value sequences to be interpreted: a UTF-8 conformant mayt map the code value sequence C0 80 (110000002 100000002) to the Unicode value U+0000, even though a UTF-8 conformant process shall never generate that code value sequence -- it shall generate the sequence 00 (000000002) instead.

Modify D36 as follows, and add a note:
 
D36 UTF-8 is the Unicode Transformation Format that serializes a Unicode code point as a sequence of one to four bytes, as specified in Table 3.1. Any UTF-8 byte sequences are illegal unless they match the patterns listed in Table 3.1B, Legal UTF-8 Byte Sequences. An irregular code unit sequence in UTF-8 is a six-byte sequence where the first three bytes correspond to a high surrogate, and the next three bytes correspond to a low surrogate.
As a consequence of C12, these irregular UTF-8 sequences shall not be generated by a conformant process.

Delete the two text paragraphs after Table 3.1. The relevant portions have been elevated into definitions or conformance clauses.

When converting a Unicode scalar value to UTF-8, the shortest form that can represent those values shall be used. This practice preserves uniqueness of encoding. For example, the Unicode binary value <0000000000000001> is encoded as <00000001>, not as <11000000 10000001>. The latter is an example of an irregular UTF-8 byte sequence. Irregular UTF-8 sequences shall not be used for encoding any other information.

When converting from UTF-8 to a Unicode scalar value, implementations do not need to check that the shortest encoding is being used. This simplifies the conversion algorithm.

Replace them by the following table and text:
 

Table 3.1B. Legal UTF-8 Byte Sequences
 Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F      
U+0080..U+07FF C2..DF 80..BF     
U+0800..U+0FFF E0 A0..BF 80..BF   
U+1000..U+FFFF E1..EF 80..BF 80..BF   
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF  80..BF

Table 3.1B. lists all of the byte sequences that are legal in UTF-8. A range of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive) is legal in that position. Any byte value outside of the ranges listed is illegal. For example, the byte sequence <C0, AF> is illegal since C0 is not legal in the 1st Byte column. The byte sequence <E0, 9F, 80> is illegal since in the row where E0 is legal as a first byte, 9F is not legal as a second byte. The byte sequence <F4, 80, 83, 92> is legal, since every byte in that sequence matches a byte range in a row of the table (the last row).

Add the following to the end of DXX for UTF-32:

An irregular byte sequence in UTF-32 is an eight-byte sequence where the first four bytes correspond to a high surrogate, and the next four bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-32  sequences shall not be generated by a conformant process.