RE: UTF-8 signature in web and email

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed May 23 2001 - 05:35:56 EDT


David Starner wrote:
> You're asking for every program to treat UTF-8 specially.

No I am not! I have been saying the exact opposite!

ZWNBSP in just one more multibyte character and UTF-8 is just one more
multibyte encoding. Why should this case be so special?

> [...]
> of now, UTF-8 is just one of many charsets in use on Unix.

In fact! So why do Unixers worry about bytes <0xEF, 0xBB, 0xBF> (a kind of
space in Unicode, often called "BOM") more than they do about byte <0xA0> (a
kind of space in ISO-8859-1) or bytes <0xA1, 0xA1> (a kind of space in
EUC-GB)?

If there is a problem, it seems to me that the 3 examples above are 3
instances of it:

1) If you interpret the sequence according to the proper encoding, they are
one white space;

2) If you interpret them as mere bytes, they are one or more unknown
characters not in the ASCII range.

Now, imagine a compiler for some C-like language. If it supports UTF-8 (or
Latin-1, or EUC-GB), when it receive strings like these:

        int \0xEF\0xBB\0xBF i; /* Unicode (UTF-8) */
        int \0xA0 i; /* ISO-8859-1, aka "Latin 1" */
        int \0xA1\0xA1 i; /* GB12345-80 (EUC) */

It will correctly interpret the sequences of bytes >= 0x80 as being "white
space" in the respective encoding, so it will parse the expression as "int
i;".

On the other hand, if the compiler does NOT understand these encodings, it
will parse it as "int WHATSTHAT i;", and issue a syntax error.

In the second case, if you want to compile *that* program with *that*
compiler, you will have to use some preprocessor to convert the text from
the unsupported character set to the encoding used by the compiler.

Where does UTF-8 and BOM make things more or less complex that they used to
be?

> This will probably just end up as another CRLF/LF issue,
> requiring plain text
> crossing from one system to another be changed.

That's what I am worrying about as well.

DOS users always had this fastidious problems importing Unix text files,
because of Unix's fantasy reinterpretation of ASCII control 0x0A as "line
break".

If Unix designers followed standards, they would have seen that the only
standard way of having a "line break" in ASCII is to combine 0x0A (meaning
"move the cursor down one line") with 0x0D (meaning "move the cursor at the
beginning of the line"), and today we wouldn't have this cross-system
inconsistency.

This is why, IMHO, Unixers should avoid more fantasy reinterpretation of
character semantics!

Encoding-aware program that "understand" Unicode, should treat U+FEFF
according to its literal meaning: "a non-breaking space having zero width".

Encoding-blind programs, which handle bytes without interpreting them,
should simply pass over sequence <0xEF, 0xBB, 0xBF> as they do with any
other byte string.

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT