Re: pre-HTML5 and the BOM

From: Philippe Verdy <>
Date: Tue, 17 Jul 2012 03:47:44 +0200

2012/7/15 David Starner <>:
> /tmp $ echo -n a > file1
> /tmp $ echo b > file2
> /tmp $ cat file1 file2 > file3
> /tmp $ echo "ab" | diff -q - file3

Once again the problem is the /bin/cat tool which is used for
everything and agnostic about preserving text selantics. using another
cat that is Unicode aware would solve the problem.

Same thing about diff which is however only designed to work with text
files and that should be Unicode aware by default.

May be there should be a new standard in Unix for /ubin/ being present
for Unicode-aware tools and insertable in user's PATH environments if
needed. Allowing migrations to newer standards.

> This is expected behavior, and with if statements is probably done by
> thousands of scripts. Add a hidden BOM at the start of file2 and this
> whole thing breaks, as diff is going to find them different. Again,
> diff is an ancient tool that deals with all sorts of text, quasi-text
> and binary matter, and frankly a<BOM>b is different from ab. If we're
> building a C file with Unix tools, if a char *c = "ab"; suddenly
> becomes char *c = "<BOM>ab"; i don't know by what semantics you expect
> that to work the same. And "the very model of a default ignorable code
> point" is likely to be the very model of a bug that will hide in plain
> sight.
Received on Mon Jul 16 2012 - 20:50:42 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 16 2012 - 20:50:46 CDT