Re: Why UTF-8 decoders must reject overlong sequences

From: Dan Oscarsson (
Date: Tue Sep 28 1999 - 09:29:16 EDT

>You miss the whole point of UTF-8: ASCII compatibility! UTF-8 was
>designed such that a large class of applications intended for ASCII can
>continued to be used with UTF-8 without ANY modification whatsoever.
>These applications typically treat text as arbitrary zero-free byte
>strings and do not care about what the characters are, except that
>certain ASCII characters have a special meaning. (Such as '/' in the
>file system.) Now think about a processing pipeline of several programs,
>including two types:
>Unsafe UTF-8 decoders are a hazzle for people who want to convert
>existing ASCII programs with minimal effort to support UTF-8. For
>instance, an ASCII editor requires significantly fewer modifications if
>the programmer can assume that line feeds will in UTF-8 also be encoded
>as 0x0a and ONLY 0x0a. Then line feeds can still be reliably recognized
>by the same mechanism that worked already for ASCII.

OK. I see your point. But I thought the most important point of UTF-8 was
that we have a well defined character set (UCS) and a simple encoding
form of UCS for interoperability.

The ASCII compatibility of UTF-8 is worthless for me, I need ISO 8859-1
compatibility. It will not help to have the programs on my Unix-system
to handle UTF-8 by ASCII compatibility. And the fact that UTF-8 is ASCII
compatible will probably make it even worse, instead of forcing
programmers to internationalise their programs, many will probably go the
easy way and just see that UTF-8 does not crash the program.
True, a line count program will work fine with UTF-8 as long as
line endings are comaptible, but that is one of very few that
will work well. It would have been better to have a UTF-8 that was
not ASCII compatible so that people really had to do something
with their programs.

So the major benefit of UTF-8 is, what I can see, not ASCII compatibility,
but interoperability! When we send data between systems using one
character set (UCS) using a well defined simple encoding and
normalisation, then we have removed one of the greatest hinders
against interoperability.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT