RE: UTF-8 signature in web and email

From: Marco Cimarosti (
Date: Tue May 22 2001 - 05:14:03 EDT

David Starner wrote:
> [...] At the fundamental heart of a Unix system is
> passing arbitrary byte streams in highly flexible
> ways. If every file starts with a signature then
> that makes that significantly more complex. [...]

You forget one fundamental thing about U+FEFF: it is not (only) a "byte
order mark" or an "encoding signature": it (also) is a "ZERO WIDTH NO-BREAK

I.e., it has been designed to be a white space, to not separate words, to
not constitute a line-break opportunity and, last but not least, to be

In other words, if correctly implemented, it is a totally non-invasive
character: a very gentle little animal that should cause no arm to nobody.

So, it is true that, when you "cat a b > c", it may happen that a spurious
ZWNBSP can go somewhere in the middle of file "c".

Consider that, if a given OS normally uses no UTF-8 signatures, it is quite
unlikely that file "b" starts with ZWNBSP. The only case when this "problem"
occurs is when file "b" is imported from another OS.

But, also in this case, why should it be a problem to have ZWNBSP in
whatever position in a file? Why should *this* character be more a problem
that SPACE, or TAB, or CARRIAGE RETURN, or COMMA, or name it?

It only becomes a problem in the presence of one or more of these *bugs*:

1) It is *mandatory* to start an UTF-8 file with a ZWNBSP;

2) It is *forbidden* to have a ZWNBSP in the middle of the file;

3) ZWNBSP is *displayed* incorrectly (e.g. a black box instead than "nothing
at all");

4) ZWNBSP is given an incorrect *semantic* value (e.g., a C compiler does
not consider it as "white space").

But, then, why blaming ZWNBSP? Fix the bug(s)!

_ Marco

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT