Re: pre-HTML5 and the BOM

From: David Starner <prosfilaes_at_gmail.com>
Date: Sat, 14 Jul 2012 15:23:33 -0700

On Sat, Jul 14, 2012 at 1:57 PM, Doug Ewell <doug_at_ewellic.org> wrote:
> We've been hearing the story about hashbang for many, many years now, and I
> still don't understand why the following logic hasn't been made part of the
> low-level I/O process in such environments:
>
> "When reading a text file that could be UTF-8 or some other ACE, if the
> first three bytes of the file are EF BB BF, discard them and assume the file
> is UTF-8."

Low-level I/O in Unix doesn't know the difference between a text file
and a binary file. Even ignoring backward compatibility issues, to do
that would take a total rewrite of basic I/O. There is not, and almost
most certainly will never be, one unified layer of I/O on Unix that
speaks Unicode. If you want that, you will use Python or Java or ICU
or some other Unicode-aware platform or library.

The specific hashbang feature could be changed to accept BOMs, but as
usual it's a bit of a Catch-22. Unix programs don't generate BOMs, so
it's not usually a concern. Most Unix people don't particularly want
UTF-8 BOMs, so the problem comes up rarely and in a behavior pattern
we want to discourage. It's not a problem in practice, and we don't
want it to become a problem in practice, so it remains a problem in
theory.

On Sat, Jul 14, 2012 at 2:14 PM, Doug Ewell <doug_at_ewellic.org> wrote:
> A related question, though, is why some people think the sky will fall if a
> text file contains loose zero-width no-break spaces. U+FEFF is the very
> model of a default ignorable code point.

/tmp $ echo -n a > file1
/tmp $ echo b > file2
/tmp $ cat file1 file2 > file3
/tmp $ echo "ab" | diff -q - file3
/tmp $

This is expected behavior, and with if statements is probably done by
thousands of scripts. Add a hidden BOM at the start of file2 and this
whole thing breaks, as diff is going to find them different. Again,
diff is an ancient tool that deals with all sorts of text, quasi-text
and binary matter, and frankly a<BOM>b is different from ab. If we're
building a C file with Unix tools, if a char *c = "ab"; suddenly
becomes char *c = "<BOM>ab"; i don't know by what semantics you expect
that to work the same. And "the very model of a default ignorable code
point" is likely to be the very model of a bug that will hide in plain
sight.

-- 
Kie ekzistas vivo, ekzistas espero.
Received on Sun Jul 15 2012 - 19:04:14 CDT

This archive was generated by hypermail 2.2.0 : Sun Jul 15 2012 - 19:04:15 CDT