Re: How to distinguish UTF-8 from Latin-* ?

From: Doug Ewell (
Date: Fri Jun 23 2000 - 03:00:56 EDT

Kent Karlsson <> wrote:

> A hacker may try to hide characters that trigger the undesired, and
> potentially dangerous, interpretation, by using overlong UTF-8
> sequences. If the security scanner program does not "decode" overlong
> UTF-8 sequences, but the interpreter accepts them as if nothing
> was wrong, things you would not like to happen might happen.
> So overlong UTF-8 sequences should be regarded as errors, and
> not as a coding for any character at all.

I agree that overlong ("irregular") sequences should be trapped, but in
reading Kent's text as well as RFC 2279 and Markus Kuhn's "Unicode and
UTF-8" web page, I am a bit puzzled at the assumptions that programs
that are able to interpret UTF-8 at all will at some point pass certain
sequences uninterpreted, allowing (e.g.) C0 AF to masquerade as U+002F.

UTF-8 decoding is so processor-efficient (compared with, say, UTF-1)
that I would think any program that expects to deal with UTF-8 text
should just go ahead and decode it from the outset. Whether it chooses
to accept or reject overlong sequences, either way the cracker's
"hidden" rogue characters are revealed and not given a chance to perform
their dirty deeds. If decoding UTF-8 were a more expensive task, I
would probably sympathize more with the need to take shortcuts, but this
is not the case.

-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT