Corrigendum #1: UTF-8 Shortest Form

Unicode Corrigenda

Tech Site | Site Map | Search

Corrigendum #1: UTF-8 Shortest Form

Corrigendum Effective Date Applicable Versions Fixed Version Result Documented In:

Corrigendum #1: UTF-8 Shortest Form 2000-Nov-09
[85-M12] 3.0.0 and 3.0.1 3.1.0
2001-March Chapter 3, Conformance

The conformance clause C12 in The Unicode Standard, Version 3.0 forbids the generation of "non-shortest form" UTF-8, and forbids the interpretation of illegal sequences, but not the interpretation of "non-shortest form". Where software does interpret the non-shortest forms, security issues can arise. For example:

Process A performs security checks, but does not check for non-shortest forms.

Process B accepts the byte sequence from process A, and transforms it into UTF-16 while interpreting non-shortest forms.

The UTF-16 text may then contain characters that should have been filtered out by process A.

To address this issue, the Unicode Technical Committee has modified the definition of UTF-8 to forbid conformant implementations from interpreting non-shortest forms for BMP characters, and clarified some of the conformance clauses.

These modifications make use of updated notation: see the Glossary for any unfamiliar terms.

Change C12 to the following:

C12 (a) When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed ~~byte~~ code unit sequences.
(b) When a process interprets data in a Unicode Transformation Format, it shall treat illegal ~~byte~~ code unit sequences as an error condition.
(c) A conformant process shall not interpret illegal UTF code unit sequences as characters.
(d) Irregular UTF code unit sequences shall not be used for encoding any other information.

Add the following notes after C12:

The definition of each UTF specifies the illegal code unit sequences in that UTF. For example, the definition of UTF-8 (D36) specifies that code unit sequences such as <C0, AF> are illegal.

Internally, a particular function might be used that does not check for illegal code unit sequences. However, a conformant process can use that function only on data that has already been certified to not contain any illegal code unit sequences.

Processes that require unique representation must not interpret irregular UTF code unit sequences as characters. They may, for example, reject or remove those sequences.

Processes may transform irregular code unit sequences into the equivalent well-formed code unit sequences.

Conformant processes cannot interpret illegal code unit sequences. However, the conformance clauses do not, for example, prevent utility programs from operating on "mangled" text. For example, a UTF-8 file could have had CRLF sequences introduced at every 80 bytes by a bad mailer program. This could result in some UTF-8 byte sequences being interrupted by CRLFs, producing illegal byte sequences. This mangled text is no longer UTF-8. It is permissible for a conformant program to repair such text, recognizing that the mangled text was originally well-formed UTF-8 byte sequences. However, such repair of mangled data is a special case, and must not be used in circumstances where it would cause security problems.

Delete the second sentence in the note under D32:

For example, UTF-8 allows nonshortest code value sequences to be interpreted: a UTF-8 conformant mayt map the code value sequence C0 80 (11000000₂ 10000000₂) to the Unicode value U+0000, even though a UTF-8 conformant process shall never generate that code value sequence -- it shall generate the sequence 00 (00000000₂) instead.

Modify D36 as follows, and add a note:

D36 (a) UTF-8 is the Unicode Transformation Format that serializes a Unicode code point as a sequence of one to four bytes, as specified in Table 3.1, UTF-8 Bit Distribution.
(b) An illegal UTF-8 code unit sequence is any byte sequence that does not match the patterns listed in Table 3.1B, Legal UTF-8 Byte Sequences.
(c) An irregular UTF-8 code unit sequence is a six-byte sequence where the first three bytes correspond to a high surrogate, and the next three bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-8 sequences shall not be generated by a conformant process.

In UTF-8, <004D 0061 0072 006B> is serialized as <4D 61 72 6B>.

The problematic "non-shortest form" byte sequences in UTF-8 were those where BMP characters could be represented in more than one way. These sequences are illegal, since they are not allowed by Table 3.1B.

Retain the paragraph and table immediately below D36, but replace the last sentence in the paragraph.

Table 3.1 specifies the bit distribution from a Unicode character (or surrogate pair) into the one- to four-byte values of the corresponding UTF-8 sequence. Note that the four-byte form for surrogate pairs involves an addition of 10000₁₆, to account for the starting offset to the encoded values referenced by surrogates. For a discussion of the difference in the formulation of UTF-8 in ISO/IEC 10646, see Section C.3, UCS Transformation Formats. The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as a transformation of Unicode characters.

Table 3.1. UTF-8 Bit Distribution
Scalar Value UTF-16 1st Byte 2nd Byte 3rd Byte 4th Byte

00000000 0xxxxxxx 00000000 0xxxxxxx 0xxxxxxx

00000yyy yyxxxxxx 00000yyy yyxxxxxx 110yyyyy 10xxxxxx

zzzzyyyy yyxxxxxx zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx

000uuuuu zzzzyyyy yyxxxxxx 110110ww wwzzzzyy 110111yy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Where uuuuu = wwww + 1 (to account for addition of 10000₁₆ as in Section 3.7, Surrogates).

Delete the two text paragraphs after Table 3.1. (The relevant portions have been elevated into definitions or conformance clauses.)

When converting a Unicode scalar value to UTF-8, the shortest form that can represent those values shall be used. This practice preserves uniqueness of encoding. For example, the Unicode binary value <0000000000000001> is encoded as <00000001>, not as <11000000 10000001>. The latter is an example of an irregular UTF-8 byte sequence. Irregular UTF-8 sequences shall not be used for encoding any other information.

~~When converting from UTF-8 to a Unicode scalar value, implementations do not need to check that the shortest encoding is being used. This simplifies the conversion algorithm.~~

Replace them by the following table and text:

Table 3.1B. Legal UTF-8 Byte Sequences
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte

U+0000..U+007F 00..7F

U+0080..U+07FF C2..DF 80..BF

U+0800..U+0FFF E0 A0..BF 80..BF

U+1000..U+FFFF E1..EF 80..BF 80..BF

U+10000..U+3FFFF F0 90..BF 80..BF 80..BF

U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF

U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

Table 3.1B. lists all of the byte sequences that are legal in UTF-8. A range of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive) is legal in that position. Any byte value outside of the ranges listed is illegal. For example, the byte sequence <C0, AF> is illegal since C0 is not legal in the 1st Byte column. The byte sequence <E0, 9F, 80> is illegal since in the row where E0 is legal as a first byte, 9F is not legal as a second byte. The byte sequence <F4, 80, 83, 92> is legal, since every byte in that sequence matches a byte range in a row of the table (the last row).

Cases where a trailing byte range is not 80..BF are underlined in the table to draw attention to them. These occur only in the second byte of a sequence.

Add to Appendix C: Relationship to ISO/IEC 10646, Section C.3: UCS Transformation Formats, at the end of the subsection UTF-8:

The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as a transformation of Unicode characters. ISO/IEC 10646 does not allow mapping of unpaired surrogates, nor U+FFFE and U+FFFF (but it does allow other noncharacters).