L2/07-116

From: "Mark Davis"
Date: 2007-04-19 17:12:37 -0700
Subject: Unicode Security Exploit

Please add this to the agenda and the document registry.

We've recently been made aware of a security exploit, and propose that text be added to address this to the next version of the standard and to UTR# 36: Unicode Security Considerations .

Problem

The Unicode consortium does not specify converters should do when they encounter malformed UTF-8. In particular, some converters, when they hit a case where there are not enough trail bytes, will "swallow" one or more following lead bytes. Even if malformed sequence itself is mapped to a safe code such as U+FFFD, the loss of the following character can be exploited in the following way.

A web page is constructed using text gathered from the user. It makes this safe by surrounding the text by HTML markup. A malicious user, however, can supply content terminated by a byte such as 0xC0, which is missing a trail byte. If the conversion process swallows the following byte, as described above, then the terminating HTML markup is lost, and an exploit take advantage of the situation.

For example, given bytes <E3 80 22 61>:

some converters will assume that the 22 is part of the initial 3-byte sequence, but is bad because there should be a trail byte in that position - and will resynch starting with the 61

others will stop before the first non-trail byte, indicate only the first two bytes as the illegal sequence, and resynch starting with the 22.

The Unicode Standard is not sufficiently explicit regarding this case Chapter 3. The byte sequence <E3 80 22 61> is clearly not a well formed sequence according to Table 3-7. But what is not stated in the standard is that a process must (1) handle the sequence <E3 80>, an ill-formed sequence according to Table 3-7, as an error condition, instead of (2) handling the sequence <E3 80 22>, alsoan ill-formed sequence according to Table 3-7, as an error condition. Where the error condition handled by aborting a conversion, this doesn't make much difference. But if it is handled by substituting a character, such as U+FFFD, for the illegal sequence, then it can make a big difference. Although less likely to happen, this could also be a problem with UTF-16.

Proposal

1. In UTS #36, we should explain the above situation, and describe the best practice (see below).

2. In the text of the standard, it appears to me that the best minimal fix would be to C10 on page 73.

C10. When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as characters.
=>
C10. When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat any ill-formed code unit sequence that does not overlap with a well-formed code sequence as an error condition, and shall not interpret such sequences as characters.

This would be sufficient to solve the problem, since well-formed sequences never overlap. We might also add some clarifying language around D92 and D91.

Example UTF-8 Sequence: <E3 80 C0 80 FF 80 C2 22>

Thus, a process can choose to break this up in different ways, so long as it does not include the 22 (which is well-formed) in the error handling as part of an ill-formed sequence.


We can also add information about best practices. While it is permissible to segment illegal sequences in different ways, the best practice is to process in memory order, treating each ill-formed sequence as a separate unit when it represents the longest possible subsequence from the bit distribution tables, Table 3-5 and Table 3-6 on page 103, or failing that, as one code unit. Thus the best practice for segmenting the above example is:

Side issue on C10 (and a couple of others).

I find the wording "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, ..." to be poorly worded; the "purporting" is a function of the process, not the sequence. That is, I think it would be better phrased as:

"If a process purports to interpret a Unicode character encoding form, when that process interprets a code unit sequence, ..." or the even simpler

"When a process interprets a code unit sequence as being in a Unicode character encoding form, ..."

--
Mark