RE: UTF-8 signature in web and email

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed May 23 2001 - 08:51:14 EDT


John Cowan wrote:
> Well, "C-like language" is a hedge. IIRC, C99 thinks
> everything above U+007F is a letter.

OK, it was a hedge. I just wanted a scenario of plain text usage familiar to
programmers, and where visualization was not the main thing.

You can chose another example of your choice.

E.g., take a word-counting utility and these three text files:

1) In UTF-8, containing: "one \0xEF\0xBB\0xBF two"
2) In ISO-8859-1, containing: "one \0xA0 two"
3) In GB-EUC, containing: "one \0xA1\0xA1 two"

A "wc" capable of handling these three encodings will count two words for
each file. A "wc" that only understands ASCII and assumes that anything >
0x7F is a graphic character will count three words for each file.

In the second case, if you want a correct count of words, you have to
preprocess the file with some sort of converter to, e.g., fold any kind of
blank characters to 0x20.

However, you may notice that UTF-8 in general, and character U+FEFF in
particular, don't complicate things more than they already are.

<OT historical speculation>

> The ambiguity of 0x0A as "line feed" versus "new line" was
> present from the beginning: at least some Teletypes had a
> mode to treat 0x0A as "new line".
> [...]
> Nope. The standard just wasn't that precise. Unix folk
> wanted a single
> line separator character, and 0x0A was the obvious choice. And it
> worked on their Model 37 Teletypes, [...]

You don't convince me. Rather, you feed up my theory with more historical
details.

The fathers of Unix used a teletype with a "non standard" mode turned on,
and assumed that any other device worked the same as *their* teletype with
*their* settings.

They haven't even considered following any standard: simply tried a sequence
on their machine and liked what happened. And they didn't realize that it
worked just because that machine had an hack to send one byte less per line!

Sequence <0x0D, 0x0A> was the right choice, because it would have worked for
a Model 33, for a Model 37 in "standard mode", and also for a Mode 37 in
"non standard mode" (where the 0x0D would have been useless, but would not
have caused any problem).

Similarly,

</OT>

the right choices for handling UTF-8 signature is, IMHO:

1) Don't write it at the beginning of files, if you don't need it. It is not
mandatory to have it. If another program on another OS breaks because of
this, it's THEIR fault. (Postil: of course, if Unicode should make the
signature mandatory, they THEY would pass on the fault side!)

2) In programs capable of decoding UTF-8, treat it as a simple "zero-width
no-break space": just another one of the many space characters in Unicode.
An occasional ZWNBSP at the beginning of a file is simply "useless" to your
application, but should not break anything.

3) In encoding-unaware programs, just transmit it transparently as any other
sequence of bytes. (Because a sequence of bytes found inside a generic file
can be anything: not necessarily UTF-8 text, non necessarily Unicode text,
not necessarily text at all: it might well be a piece of digitalized music
or the header of a picture!)

4) Encoding-aware programs which do not understand UTF-8 will break. Of
course!! You have some choices: don't use those programs with files in UTF-8
(or any other unsupported encoding); enhance them to support UTF-8 (or any
other encoding you might need); convert the file to an encoding supported by
the program.

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT