RE: UTF-8 signature in web and email

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu May 24 2001 - 07:10:10 EDT


David Starner wrote:
> > > of now, UTF-8 is just one of many charsets in use on Unix.
> >In fact! So why do Unixers worry about bytes <0xEF, 0xBB,
> 0xBF> [...]
> Because if 0xA0 or 0xA1 0xA1 (or 0x20) show at the start of a script,
> it's wrong. [...]

OK. I had written a reply to all your points but then, re-reading it, I saw
that the whole discussion was becoming quite pointless: what are we
disputing about exactly?

As you said, this risk to turn this into a flame about nothing, while it is
quite probable that we actually agree on most things.

So I will re-explain my points in a (hopefully) clearer way:

1) I agree that emitting an UTF-8 signature is a bad idea (either in Unix or
other OS's).

2) It seems that you and I are not the only ones who agree on point 1.
Windows applications, for instance, don't use the signature. I begin to see
many Web pages or e-mails in UTF-8 and, as far as I can see, no one of them
use start with a BOM. And it seems to me that not even the Unicode
Consortium itself is pushing very much on this idea; on the contrary, I have
seen a certain insistence on how easy it is to detect UTF-8 with *no* need
of a signature.

3) However, generally speaking, any Unicode character can be in any position
in a plain text file, and any Unicode character should be treated according
to its semantics by an Unicode application. So, also a BOM should be
accepted in a general text file, whether at the beginning or elsewhere. If
you disagree with this, you disagree with the very concept of "plain text",
IMHO.

4) A BOM at the beginning of a file (or elsewhere) may indeed be against the
syntax of some computer languages. But this is nothing new! The usage of
*any* character is by definition not "free" when it is subject to a formal
syntax.

5) Provided that point 1 is respected, the problem at point 4 may only occur
in very unlikely cases, e.g.:
        - The source file is imported from another environment that uses
UTF-8 signatures (but I currently know no such environment);
        - The programmer explicitly inserts U+FEFF at the beginning of the
file (Which is clearly an absurd deed. And, however, how could she do such a
thing? It is unlikely that many keyboards will contain a BOM);
        - The source file was originally written in UTF-16 or UTF-32, and it
has undergone a too naive conversion, which copied along the BOM. In this
case, it is just a small fix to be done in the program that converts from
one encoding to another.

6) In the unlikely case that problem 4 occurs, the presence of the BOM will
probably not be the only problem in the source file. You will anyway have to
edit it and change the incompatible syntax. E.g. you said that commands in a
makefile should be preceded by a TAB. But this is not true with my DOS Make:
I can use one or more SPACEs or TABs, in any combination. So, regardless of
UTF's or BOM's, you won't be able to use my makefiles on Unix without some
editing.

I hope I made my thought clearer. Could you perhaps restate your point as
well? I must confess that I haven't quite grasped what you want to change
and where.

Ciao.
_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT