RE: How to distinguish UTF-8 from Latin-* ?

From: Karlsson Kent - keka (
Date: Thu Jun 22 2000 - 13:49:10 EDT

> -----Original Message-----
> From: Robert A. Rosenberg []

[on overlong UTF-8 sequences, a few lines down:]
> faked) files. I agree that missed the extra sanity check of
> looked for
> shortest string but if I remember the rules correctly, there is no
> requirement the shortest form be emitted - only a strong
> suggestion to do
> so (with a stronger suggestion to accept it [ie: "Be liberal
> with what you accept and conservative with what you create"]).

Well, there is a security aspect to this: sometimes given texts
need to be scanned to try to determine if they are "harmless"
or may trigger some undesirable interpretation (as interpreted
program code, like shell-script, for instance). A hacker may
try to hide characters that trigger the undesired, and potentially
dangerous, interpretation, by using overlong UTF-8 sequences.
If the security scanner program does not "decode" overlong
UTF-8 sequences, but the interpreter accepts them as if nothing
was wrong, things you would not like to happen might happen.
So overlong UTF-8 sequences should be regarded as errors, and
not as a coding for any character at all. Yes, you may regard
systems that at all have "escapes" into "execute this" mode
as ill-designed. But they are around.

                Kind regards
                /kent k

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT