Re: UTF-8 signature in web and email

From: David Starner (dstarner98@aasaa.ofe.org)
Date: Mon May 21 2001 - 15:29:26 EDT


At 11:39 AM 05/21/2001 -0400, DougEwell2@cs.com wrote:
>In the Windows world that I live in, we expect to update our compilers and
>other tools every few years, for a variety of reasons (not all of which have
>to do with marketing or planned obsolescence). This is both good and bad,
>but in general it is just the way we tend to think. If upgrading a compiler
>or similar tool is an extraordinary event for users of other systems, then
>obviously UTF-8 signatures will cause problems -- but these programs will
>also be unable to convert or otherwise interpret UTF-8, except to treat the
>bytes as if they were in the native encoding.

The assumption under Unix is that if you want to work with UTF-8 text files,
you will set the native encoding to UTF-8. At the fundamental heart of a Unix
system is passing arbitrary byte streams in highly flexible ways. If every
file starts with a signature then that makes that significantly more complex.
"cat a b > c" is going to end up with a BOM in the middle of c. cat could
detect
the second signature and remove it . . . but then b is a byte stream (picture,
sound, noise) that happens to start with a BOM, and something breaks.
cat is a simple program that works simply. Adding or removing characters
breaks expections. Should both stderr and stdout start with a BOM? Frequently,
they go to the same file, but often they don't. Should grep add the BOM?
What if you're dealing with binary data? What about
"grep Mom a > file; grep Dad a >> file"? A program can take data from stdin, so
you can't expect a BOM, but you can always trust the locale variables, and
the locale variables can be changed easily by the user.

Frankly, we've survived for several decades with multiple character sets and no
automatic detection. Adding a lot of complexity now that we are trying to go to
one character set for automatic detection seems absurd.

-- 
David Starner - dstarner98@aasaa.ofe.org



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT