UTF-8 security

From: Doug Ewell (dewell@compuserve.com)
Date: Tue Oct 05 1999 - 12:26:10 EDT


I've been thinking about UTF-8 and specifically Markus Kuhn's position
that UTF-8 decoders need to catch and filter out non-minimal sequences,
such as {0xC0, 0x8A} for U+000A LINE FEED or {0xC0, 0xAF} for U+002F
SOLIDUS.

Markus makes a good point that one of the benefits of properly
constructed UTF-8 is that ASCII characters are only represented by
themselves, not hidden in sequences like {0xC0, 0x8A}, and that existing
tools shouldn't have to worry about such bogus sequences. Certainly in
situations where security is an issue, these sequences need to be
filtered out.

But another advantage of UTF-8, not mentioned as often, is the speed and
computational efficiency of converting between {UCS-2, UTF-4, UTF-16}
and UTF-8. Anyone who has built encoders and/or decoders for UTF-1 or
UTF-7 can see how much easier UTF-8 is, both on the programmer and on
the CPU. If your application can deal with Unicode characters at all,
adding UTF-8 support is extremely simple.

Now I agree with Markus's point that the extra checking for non-minimal
sequences is computationally simple. In fact, I added this checking to
my own UTF-8 decoders. I would just not go so far as to require *all*
UTF-8 decoders to implement this checking.

The Unicode FAQ makes a distinction between the two type of "ill-formed"
byte sequences. There are "illegal" sequences such as {0xC0, 0x6F},
which cannot be converted to any Unicode character (but which Dan
Oscarsson's "Latin-1-compatible" decoder would go ahead and process as
"Ào"), and then there are "irregular" sequences such as {0xC0, 0xAF},
which *can* be converted but *should not* be if security is a concern.

Different applications have different security needs, and not all
programs have to worry equally about receiving irregular sequences.
Consider, for example, an dictionary and encyclopedia on CD-ROM that
are taking full advantage of Unicode to encode IPA word pronunciations,
foreign languages, math symbols, etc. in plain text, and use UTF-8 as
the storage encoding. This program doesn't receive any text from the
outside world, so there should never be a danger of irregular sequences
in the first place; and it does not have any security implications, so
even if an irregular sequence existed, it's not as if some hacker would
be granted access to a secure system. The main requirements of this
closed-system UTF-8 decoder would be speed, speed, and more speed.
Developers of applications like this can decide for themselves what is
necessary.

In the future, UTF-8 encoding and decoding will likely be a function of
the operating system, and since the OS can't tell whether an app needs
the additional security or not, it should definitely check for irregular
sequences at the expense of a small amount of processing time.

Comments are very welcome.

-Doug Ewell
 Placentia, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT