Re: UTF-8N?

From: Doug Ewell (
Date: Wed Jun 21 2000 - 00:27:35 EDT

Juliusz Chroboczek <> wrote:

> I have the impression that we basically agree, except that you expect
> the system to reliably keep track of file types, and I don't.
> PC> type of object which, when assembled in accordance with that
> PC> protocol, can produce a plain text file.
> PC> We shouldn't mix up the use of the BOM and protocols that are not
> PC> directly related to Unicode.
> So you have two distinct operations, ``split a text file'' and ``split
> an octet file''. Symmetrically, ``concatenate text files'' and
> ``concatenate octet files.'' If your splitting and concatenating
> operations mismatch, you die.
> Of course, no mismatch happens if the OS keeps track of file types.
> Splitting in the octet manner a text/plain file leads to two
> octet-stream files, and the OS should ensure that you cannot merge
> them in the wrong way.

On the contrary, I thought Peter's point was that the OS (or the split/
merge programs) should *not* make any special assumptions about text
files. Specifically, they should not insert or strip BOM's on the basis
that they think they are dealing with a Unicode text file, any more than
if they were dealing with any other type of file.

As long as we are talking about files... not generic "streams"... and
as long as programs that split and merge them, compress and decompress
them, encrypt and decrypt them, or whatever, treat them as generic files
of bytes and do not attempt to treat them specially because they are
Unicode text, then I agree with Antoine that the presence or absence of
a BOM is seldom going to be a life-or-death matter in the real world.

Think about the meaning of ZERO-WIDTH NO-BREAK SPACE. It doesn't cause
a visible space, and it doesn't cause (indeed, it inhibits) a logical
space between words. How likely is it that this character, with its
ZWNBSP meaning, is going to appear at the beginning of a UTF-8 text
file? The ZWNBSP functionality seems useful only *between* other
characters. No, the great likelihood is that any U+FEFF appearing as
the first character in a file will be intended as a byte order mark.

If I were modifying an existing program to understand UTF-8, I would
want to make it flexible as to the presence or absence of initial
U+FEFF. That is, if my program insists that a certain file begin with
the characters "#!" (U+0023 U+0021), and I want this file to be encoded
in UTF-8, then I would want to modify the program so that the file could
optionally begin with U+FEFF U+0023 U+0021 instead.

It may be useful shorthand to define the term "UTF-8N" to refer to UTF-8
text that does not begin with a BOM, and reserve the term "UTF-8" for
text that *does* begin with a BOM, but the fact is that both are really
UTF-8, and people will use the term "UTF-8" to refer to both. Adding
(let alone registering) a new charset name to express this relatively
minor difference will make it look (as it does to Juliusz) like there
are more Unicode encoding forms than there really are.

-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT