Re: Why UTF-8 decoders must reject overlong sequences

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Mon Sep 27 1999 - 08:23:37 EDT


Dan Oscarsson wrote on 1999-09-27 11:14 UTC:
> The warning in RFC 2279 it not relevant.

I strongly disagree.

> But, if you have software that is going to handle UTF-8 encoded data,
> in the UTF-8 encoded form. Then you must check that the UTF-8 encoded
> data is of the format the software expect. Otherwise may for example,
> pattern matching fail.

You miss the whole point of UTF-8: ASCII compatibility! UTF-8 was
designed such that a large class of applications intended for ASCII can
continued to be used with UTF-8 without ANY modification whatsoever.
These applications typically treat text as arbitrary zero-free byte
strings and do not care about what the characters are, except that
certain ASCII characters have a special meaning. (Such as '/' in the
file system.) Now think about a processing pipeline of several programs,
including two types:

  - Some, which have just been designed for ASCII and remain
    perfectly-correct ignorant on the differences between
    UTF-8 and say CP437.

  - Others, which are fully aware of Unicode and contain a full
    UTF-8 decoder and use Unicode values internally.

The possible coding ambiguity for ASCII characters (in the form of
overlong sequences) allows now people to encode ASCII characters that
are seen by Unicode applications but not ASCII applications. There are
two possible fixes for this incompatibility:

  - The silly one: all applications have to contain UTF-8 decoders
    (Silly, because this defies the whole purpose of UTF-8, which is
    to not require modifications of a large class of programs.)
  - The good one: all UTF-8 decoders must reject overlong sequences and
    therefore remain ASCII compatible by not accepting characters
    in the range U+0000 - U+007f that would not have been recognized
    as ASCII characters by ASCII programs.

UTF-8 decoders that accept overlong sequences implement only the first
half of what it takes to keep UTF-8 ASCII compatible. It boggles the
mind at first a bit, but it is of crucial importance that for some types
of compatibility, certain sequences must be rejected. You get
better compatibility by not trying to decode everything.

Using unmodified ASCII programs to handle UTF-8 becomes significantly
safer if we have a guarantee that UTF-8 decoders later in the pipeline
will reject overlong sequences.

Compatibility of A and B not only means the guarantee that something
works for both A and B, it also means often the guarantee that something
does *not* work for both A and B. Computer security is is the most
dramatic area where this shows up. This is because computer security is
a field that depends to a great deal on things that are guaranteed to be
not working.

Unsafe UTF-8 decoders are a hazzle for people who want to convert
existing ASCII programs with minimal effort to support UTF-8. For
instance, an ASCII editor requires significantly fewer modifications if
the programmer can assume that line feeds will in UTF-8 also be encoded
as 0x0a and ONLY 0x0a. Then line feeds can still be reliably recognized
by the same mechanism that worked already for ASCII.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT