Subject: Security concerns with UTF-8
Editor: Edwin Hart
ISO/IEC 10646-1: 2000
IETF RFC 2279
This document summarizes the email discussion to date on the security issues with the UTF-8, particularly discussion in The Unicode Standard, Version 3.0 book.
Email from Edwin Hart
Thu 2000-10-19 16:10
Subject: UTC action on malformed/illegal UTF-8 sequences?
Does the UTC need to address the issue of malformed and illegal UTF-8 sequences, etc.? The text in question is the example in D32 and the last sentence of the section on shortest encoding.
The Unicode philosophy has been to avoid killing characters your software doesn't understand. This enables adding new characters to the code without killing the software that was written before the new characters were added.
The Security philosophy seems to be: If it is out of specification, kill it ("Anything not explicitly allowed is denied.").
The situation in the attached message is not the same as in the first paragraph.
RFC 2279 (UTF-8) lists some examples that could cause security problems.
Section 3.8 of The Unicode Standard, Version 3.0 seems to permit interpretation of "ill-formed code value sequences" that can cause other software to mis-interpret the characters and produce the wrong action.
The issue for UTC may be: If a process receives an "ill-formed" code sequence, should the standard specify the action or allow interpretation and give warnings (like RFC 2279). Will more software break if the ill-formed sequence is allowed or denied? Given the number of security problems and fixes I see a week, I personally think that the UTC needs to tighten the algorithms and require an exception condition rather than interpret the ill-formed code value sequences
From: Cris Bailiff [mailto:c.bailiff@E-SECURE.COM.AU]
Sent: Thursday, October 19, 2000 6:08 AM
Subject: Re: IIS %c1%1c remote command execution
> Florian Weimer <Florian.Weimer@RUS.UNI-STUTTGART.DE> writes:
> This is one of the vulnerabilities Bruce Schneier warned of in one of
> the past CRYPTO-GRAM isssues. The problem isn't the wrong time of
> path checking alone, but as well a poorly implemented UTF-8 decoder.
> RFC 2279 explicitly says that overlong sequences such as 0xC0 0xAF are
As someone often involved in reviewing and improving other peoples web code, I have been citing the unicode security example from RFC2279 as one good reason why web programmers must enforce 'anything not explicitly is allowed is denied' almost since it was written. In commercial situations I have argued myself blue in the face that the equivalent of (perl speak) s!../!!g is not good enough to clean up filename form input parameters or other pathnames (in perl, ASP, PHP etc.). I always end up being proved right, but it takes a lot of effort. Should prove a bit easier from now on :-(
> It's a pity that a lot of UTF-8 decoders in free software fail such
> tests as well, either by design or careless implementation.
The warning in RFC 2279 hasn't been heeded by a single unicode decoder that I have ever tested, commercial or free, including the Solaris 2.6 system libraries, the Linux unicode_console driver, Netscape commuicator and now, obviously, IIS. Its unclear to me whether the IIS/NT unicode decoding is performed by a system wide library or if its custom to IIS - either way, it can potentially affect almost any unicode aware NT application.
I have resisted upgrading various cgi and mod_perl based systems to perl5.6 because it has inbuilt (default?) unicode support, and I've no idea which applications or perl libraries might be affected. The problem is even harder than it looks - which sub-system, out of the http server, the perl (or ASP or PHP...) runtime, the standard C libraries and the kernel/OS can I expect to be performing the conversion? Which one will get it right? I think Bruce wildly understated the problem, and I've no idea how to put the brakes on the crash dive into a character encoding standard which seems to have no defined canonical encoding and no obvious way of performing deterministic comparisons.
I suppose as a security professional I should be happy, looking forward to a booming business...
Response from Ken Whistler
Thu 2000-10-19 18:05
First of all, I don't think this discussion should be cc'd to unicore, x3l2 *and* the unicode list. You are discussing the specifics of a (possible) proposed change to the Unicode Standard, and the best forum for that is unicore.
Regarding the specifics you are concerned about, I am becoming convinced
that the security community has decided this is a security problem in
UTF-8. I'm not convinced myself, yet, but in this area, the perception
of a problem *is* a problem.
The main issue I see is in the note to D32, which seems to imply that,
contrary to the statements elsewhere, irregular UTF-8 sequences *are*
o.k., and can in fact be interpreted.
I think we need to clean up the note to D32, and the text at the end of
section 3.8. The statements involving "shall" at the end of section 3.8,
with clear implications for conformance, should be upgraded to explicit
numbered conformance clauses under Transformations in Section 3.1. In
particular, those requirements are:
"When converting a Unicode scalar value to UTF-8, the shortest form
that can represent those values shall be used."
"Irregular UTF-8 sequences shall not be used for encoding any other
These really belong as explicit conformance clauses (i.e. C12a, C12b),
Then all the hedging in D32 and at the end of Section 3.8 about well,
maybe you can interpret the non-shortest UTF-8 anyway should be
recast along these lines:
The Unicode Standard does not *require* a conformant process
interpreting UTF-8 to *detect* that an irregular code value sequence
has been used. [[ Fill in here, blah, blah, blah, about how more
efficient conversion algorithms can be written for UTF-8 if they
don't have to special-case non-shortest, irregular sequence UTF-8...]]
However, the Unicode Standard does *recommend* that any process
concerned about security issues detect and flag (by raising exceptions
or other appropriate means) any irregular code value sequence.
This recommendation is to help minimize the risk that a security
attack could be mounted by utilizing information stored in
irregular UTF-8 sequences undetected by an interpreting process.
If we cast things this way, it will be clear to all the concerned
security worrywarts (that's their job, man) that the Unicode Standard
has considered the issue and has a position on it. It will also be
clear in the conformance clauses that conformance to the Unicode
Standard itself *requires* the non-production of irregular UTF-8
sequences. However, the standard isn't going to reach out and place
a draconian *interpretation* requirement on a UTF-8 interpreting process
(most often, we are talking about a UTF-8 --> UTF-16 conversion
algorithm) that would force everybody to do the shortest value
checking in order to be conformant.
For a reductio ad absurdum, take the C library function strcpy().
As it stands now, right out of the box, the strcpy() function is
Unicode conformant for use with UTF-8. If you feed a null-terminated
UTF-8 string at it, it will correctly copy the contents of that
string into another buffer. But if we went for an overly strong
conformance clause regarding irregular sequence UTF-8, technically
strcpy() would no longer be conformant for use in a Unicode
application. You would have to rewrite it so it parsed out the
UTF-8 stream, checked for irregular sequences, and raised an
exception or returned an error if it ran into the sequence
0xC0 0x81, for example. I know the nitpickers can pick nits on
this example, since strcpy() really just copies code units, not
characters, but it wouldn't be too hard to find API's or processes
that are concerned with characters per se and that would have
similar problems if forced to detect and reject non-shortest UTF-8
in order to be conformant.
Comment from Mark Davis on Frank da Cruz’s email
Fri 2000-10-20 11:52
I think that is an excellent suggestion. The code on the Unicode site needs
The way I would do it is to have an extra parameter, say "secure". If ==
true, then a shortest field check is made.
----- Original Message -----
From: "Frank da Cruz" <firstname.lastname@example.org>
To: "Multiple Recipients of Unicore" <email@example.com>
Cc: "Markus Kuhn" <Markus.Kuhn@cl.cam.ac.uk>
Sent: Friday, October 20, 2000 07:49
Subject: Re: UTC action on malformed/illegal UTF-8 sequences?
> Ed Hart wrote:
> > Does the UTC need to address the issue of malformed and illegal UTF-8
> > sequences, etc.?
> There has been much discussion of this here and in other fora, particularly
> the Linux-UTF8 mailing list, even before the Bruce Schneier piece.
> Does the cvtutf.c program at the Unicode site need updating to guarantee
> it outputs only the shortest UTF8 sequences, and rejects overlong or
> malformed sequences?
> In any case, the best way to prevent UTF-8 from becoming a hacker's
> playground is to publish official, safe, portable routines to convert to
> and from UTF-8. Also, these routines should also not have an "all rights
> reserved" copyright if you want them to find their way into software that
> everybody uses (I'm not a license bigot -- quite the opposite -- but I
> think the objective here must be to propogate safe UTF-8 conversion as
> widely as possible, and there are many who will not touch code that has a
> restrictive license -- and note that many consider the GPL a restrictive
> - Frank
Response from Ed Hart to Ken Whistler plus additional information
Fri 2000-10-20 12:16
You are right and I am now limiting the discussion to Unicore.
I had further discussions on this topic with my manager, who sent me the original article. In short, I strongly believe that the UTC must tighten the algorithm and conformance clause to require compliant implementations to implement error checking and shortest value checking.
First of all, I agree with Ken that just the perception of a problem makes it a problem. However in this instance, my manager said that the security experts do not merely "perceive" the lack of error checking in UTF-8 to be a problem, lack of error checking in UTF-8 implementations IS a problem. Moreover, I think that Unicode, Inc. is not in a position to fight a lawsuit over it. (For some reason, potential lawsuits seem to quickly get the attention of most boards of directors.) In short, the impact of not tightening the algorithm (and getting the change publicized) could be far reaching and costly. Even if Unicode, Inc. does not have any direct financial consequences, managers in some organizations may decide not to implement Unicode because they perceive that "Unicode has security holes". Here, Ken's comment about perception rings strongly.
Here is some background on the specific event that triggered my posting. Hackers were discussing the loophole with respect to Microsoft's IIS. The security hole was so bad that when it was reported to Microsoft, the whole IIS team worked all Friday night on it and then over the same weekend Microsoft called major customers at home with the message to go directly to work to install the patch. See "Microsoft urging users to patch 'serious' IIS security hole" at http://www.computerworld.com/cwi/story/0%2C1199%2CNAV47_STO52573_NLTam%2C00.html. The article does not mention that the problem resulted from the UTF-8 code implementation. My manager said that the UTF-8 details surfaced in discussions on the BUGTRAQ discussion list (www.securityfocus.com which also points to the ComputerWorld article). Admittedly, the security problem arose from an implementation of UTF-8 rather than the Unicode description of the algorithm. However, the UTC must tighten the algorithm and the conformance requirements so that implementers have no comebacks to Unicode, Inc. over UTF-8. Moreover, while were looking at UTF-8, we should also look at UTF-16 and other algorithms. As Frank da Cruz suggested, we need to look at the sample software as well.
Additional Comments from Mark Davis
Fri 2000-10-20 12:45
We should discuss this at more length in the meeting. I would view the
problem as follows -- but fill me in if I'm wrong:
I assume that ASCII repertoire characters are at issue. Probably
particularly sensitive are the C0 controls, but there could be others. A
scenario that would cause problems is the following:
process A: works in charset X. It does interesting things, and is sensitive
to certain characters or sequences of characters. It expects problematic
cases to be filtered before it sees them.
process B: converts UTF-8 to charset X, and hands it off to A. It does no
process C: filters bytes for items that will cause problems for A, then
hands it off to B. C assumes that 'a' can only be represented by the byte
In this kind of scenario, the problem is that the 'filtering' step occurs
before the 'convertion to characters' step. I can see three possible fixes:
A. add filtering
B. check for shortest form
C. know about UTF-8.
Clearly in many real-world situations, the easiest solution is to fix B.
Because of that, I am leaning towards eliminating the shortest form issue.
However, we would have to deal with the issue of backwards compatibility,
and also situations where code can be simpler because it is guaranteed that
the text is in correct UTF-8 (no boundary conditions at all have to be
[Curiously, this is the only case where UTF-1 was better than UTF-8: there
was no shortest form issue in UTF-1.]
The other thing we could do to lelp address this problem is to add a
conformance test for UTF-8.
Comment from Doug Ewell [firstname.lastname@example.org]
Tue 2000-10-24 00:35
"Hart, Edwin F." <Edwin.Hart@jhuapl.edu> wrote:
> Does the UTC need to address the issue of malformed and illegal UTF-8
> sequences, etc.? The text in question is the example in D32 and the
> last sentence of the section on shortest encoding.
> The issue for UTC may be: If a process receives an "ill-formed"
> code sequence, should the standard specify the action or allow
> interpretation and give warnings (like RFC 2279). Will more software
> break if the ill-formed sequence is allowed or denied? Given the
> number of security problems and fixes I see a week, I personally
> think that the UTC needs to tighten the algorithms and require an
> exception condition rather than interpret the ill-formed code value
Ever since I first saw this topic come up on the list, I have moved
farther and farther over to the "security" side, and the conversion is
now complete. IMHO, there is *nothing* positive to be gained from
allowing 0xC0 0x80 to be interpreted as U+0000, as definition D32
explicitly allows, and as Markus Kuhn has pointed out, the code to
perform illegal-sequence checking is simple and quite fast (I know,
I've implemented it).
Ed had quoted Cris Bailiff <email@example.com> thusly:
> The warning in RFC 2279 hasn't been heeded by a single unicode
> decoder that I have ever tested, commercial or free, including the
> Solaris 2.6 system libraries, the Linux unicode_console driver,
> Netscape commuicator and now, obviously, IIS.
Well, obviously Cris has never tested MY decoder. (OK, that's not
fair, since I've never published it.) But then:
> I've no idea how to put the brakes on the crash dive into a character
> encoding standard which seems to have no defined canonical encoding
> and no obvious way of performing deterministic comparisons.
Now we're back to the Bruce Schneier premise that Unicode is horribly
and irreparably flawed, when the truth is that UTF-8 would be just as
secure as any other encoding form ever invented if the UTC would only
tighten the spec and forbid conformant decoders from interpreting
overlong sequences, as Edwin suggests.
Comment from Karlsson Kent - keka [firstname.lastname@example.org]
Tue 2000-10-24 12:18
The UTF-8 specification already mentions octet values FE and FF as
"not used" (page 894 of 10646-1:2000). Why not start from that,
which already disallows a subset of "overlong sequences" (namely
those of lengths 7 and 8 octets)? One should also avoid mentioning
of "overlong sequences" in a specification of this, since 1) it is
not precise, and 2) leads implementers to try to figure out exactly
what is being ruled out. What is ruled out should be said explicitly
and precisely. So, *in addition to the general UTF-8 rules*, I see
the following alternatives for the further restrictions to be applied:
Alt. 0, current: Forbidding anything over 31 bits (special case
of overlong sequences):
The octets FE and FF shall not be used.
Alt. 1: Forbidding overlong sequences and anything over 31 bits:
The octets C0, C1, FE, and FF shall not be used.
After an E0 octet the next octet shall be at least A0,
after an F0 octet the next octet shall be at least 90,
after an F8 octet the next octet shall be at least 88, and
after an FC octet the next octet shall be at least 84.
Alt. 2: Forbidding overlong sequences and anything over 21 bits:
The octets C0, C1, F8, F9, FA, FB, FC, FD, FE, and FF shall
not be used.
After an E0 octet the next octet shall be at least A0, and
after an F0 octet the next octet shall be at least 90.
Alt. 3: Forbidding overlong sequences and anything over 10FFFF:
The octets C0, C1, F5, F6, F7, F8, F9, FA, FB, FC, FD, FE,
and FF shall not be used.
After an E0 octet the next octet shall be at least A0,
after an F0 octet the next octet shall be at least 90, and
after an F4 octet the next octet shall be at most 8F.
Suggestion: Forbid overlong sequences and anything over 10FFFF:
The octets C0, C1, and F5-FF shall not be used. After an E0
octet the next octet (as an unsigned integer) shall be at
least A0, after an F0 octet the next octet shall be at
least 90, and after an F4 octet the next octet shall be at
most 8F. (All numerals above are in hexadecimal notation.)
If this is to be applied to the text in annex D of 10646-1:2000,
there are some implied changes that I will not detail here.