RE: UTF-8 signature in web and email

From: David Starner (dstarner98@aasaa.ofe.org)
Date: Thu May 24 2001 - 01:09:41 EDT


At 11:35 AM 05/23/2001 +0200, Marco Cimarosti wrote:
>David Starner wrote:
> > You're asking for every program to treat UTF-8 specially.
>
>No I am not! I have been saying the exact opposite!

[...]

> > [...]
> > of now, UTF-8 is just one of many charsets in use on Unix.
>
>In fact! So why do Unixers worry about bytes <0xEF, 0xBB, 0xBF> (a kind of
>space in Unicode, often called "BOM") more than they do about byte <0xA0> (a
>kind of space in ISO-8859-1) or bytes <0xA1, 0xA1> (a kind of space in
>EUC-GB)?

Because if 0xA0 or 0xA1 0xA1 (or 0x20) show at the start of a script,
it's wrong. If 0xA0 or 0xA1 0xA1 (or 0x20) show up at the start of line
in Make (that's supposed to start with a tab), it's wrong. If 0xA0 is at
the end of a line or adds an extra space to the start of line when being
processed by gnat with style checking, it's wrong. Hence a BOM at any
of those places is wrong, and the problem is only made worse by the fact
that it's zero-width.

>Where does UTF-8 and BOM make things more or less complex that they used to
>be?

No one indiscriminately uses NBSP and expects it to be ignored. If you use
it, you expect it to be treated as a space. And you don't use for most
things; a space works just as well in code, scripts and config files.

>Encoding-aware program that "understand" Unicode, should treat U+FEFF
>according to its literal meaning: "a non-breaking space having zero width".

And there are a lot of places in Unix where spaces are significant or
disallowed. The BOM threatens to place one at the start of random files,
from which it will spread haphazardly into the middle of other files.

I don't think flaming about a decision made 30 years ago is going to
help the fact that Unix has fundamentally different needs on this
issue then Windows, and if Windows is going to use a BOM, then text
is sometimes going to have to be converted between the two. (Yes, it
can be ignored when viewing the file. That's often true with CRLF stuff,
too.)

--
David Starner - dstarner98@aasaa.ofe.org



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT