Re: UTF-8 signature in web and email

From: DougEwell2@cs.com
Date: Thu May 17 2001 - 22:58:07 EDT


The "UTF-8 signature" discussion appears every few months on this list,
usually as a religious debate between those who believe in it and those who
do not. Be forewarned, my religion may not match yours. :-)

Keld Jrn Simonsen wrote:

> For UTF-8 there is no need to have a BOM, as there is only one
> way of serializing octets in UTF-8. There is no little-endian
> or big-endian. A BOM is superfluous and will be ignored.

The debate is not about whether byte order needs to be specified in a UTF-8
file (of course it doesn't) but whether U+FEFF should be used as a signature
to identify the file as UTF-8, rather than some other byte-oriented encoding.

Martin Drst wrote:

> There is about 5% of a justification
> for having a 'signature' on a plain-text, standalone file (the reason
> being that it's somewhat easier to detect that the file is UTF-8 from the
> signature than to read through the file and check the byte patterns
> (which is an extremely good method to distinguish UTF-8 from everything
> else)).

A plain-text file is more in need of such a signature than any other type of
file. It is true that "fancy" text such as HTML or XML, which already has a
mechanism to indicate the character encoding, doesn't need a signature, but
this is not necessarily true of plain-text files, which will continue to
exist for a long time to come.

The strategy of checking byte patterns to detect UTF-8 is usually accurate,
but may require that the entire file be checked instead of just the first
three bytes. In his September 1997 presentation in San Jose, Martin conceded
that "Because probability to detect UTF-8 [without a signature] is high, but
not 100%, this is a heuristic method" and then spent several pages evaluating
and refining the heuristics. Using a signature is not somewhat easier, it is
*much* easier.

> - When producing UTF-8 files/documents, *never* produce a 'signature'.
> There are quite some receivers that cannot deal with it, or that deal
> with it by displaying something. And there are many other problems.

If U+FEFF is not interpreted as a BOM or signature, then by process of
elimination it should be interpreted as a zero-width no-break space (ZWNBSP;
more on this later). Any receiver that deals with a ZWNBSP by displaying a
visible glyph is not very smart about they way it handles Unicode text, and
should not be the deciding factor in how to encode it.

What are the "many other problems"? Does this comment refer to programs and
protocols that require their own signatures as the first few bytes of an
input file (like shell scripts)? The Unicode Standard 3.0 explicitly states
on page 325, "Systems that use the byte order mark must recognize that an
initial U+FEFF signals the byte order; it is not part of the textual
content." Programs that go bonkers when handed a BOM need to be corrected to
conform to the intent of the UTC.

> For XML, the 'signature' is now listed in App F.1
> http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info
> But this is not normative, and fairly recent, and so you should never
> expect an XML processor to accept it (except as a plain character
> in the file when there is no XML declaration).

Everything about XML is "fairly recent." And again, current versions of
applications that are slightly broken in their handling of legitimate Unicode
characters should not dictate the way Unicode is to be used.

In the C9X charter, the base document for the revision of the C programming
language, the #1 guiding principle was "Existing code is important, existing
implementations are not." In the Unicode context, "code" is textual Unicode
data and "implementations" are browsers, XML processors, and such.
Implementations will in time be upgraded to provide better Unicode support.
The techniques used to encode text in UTF-8 should not be dependent on the
current imperfection of today's implementations.

Back on 2000-06-22, in the thread "Re: UTF-8N?", Ken Whistler pointed out
that the real problem with the UTF-8 signature/BOM was that its functionality
had been "bizarrely unified" with that of the ZWNBSP, and noted that the new
character U+2060 WORD JOINER would be introduced in Unicode 3.2 to take over
the ZWNBSP duties from U+FEFF. Indeed, the proposed Unicode 3.2 code chart
(available at http://www.unicode.org/charts/draftunicode32/U32-2000.pdf)
describes the WORD JOINER explicitly as "intended for disambiguation of
functions for BOM."

What all this means is that the UTC is committed to preserving the utility of
U+FEFF as a byte order mark, and by extension a signature. As Marco
Cimarosti observed, the FAQ on the Unicode Web site describes the use of the
BOM as a signature for "otherwise unmarked" UTF-8 text files, without once
deprecating or discouraging that usage.

The possibility of confusion over interpreting an initial U+FEFF as BOM or
ZWNBSP absolutely should NOT be a justification for discouraging the BOM.
The sole purpose of a zero-width no-break space -- regardless of where or how
encoded -- is to divide two lexical units logically without rendering a
visible space or line break. When would such a character ever be appropriate
as the first character of a text stream? What would it divide?

"But what about a process that breaks a text stream into chunks and, say,
transmits the chunks down a wire? You can't depend on the meaning of an
'initial' U+FEFF then." That's true, but any process that deals with data in
this manner should not be interpreting or modifying it anyway. Imagine the
damage that could be caused to CR/LF pairs that were inadvertently separated
into two chunks.

Ken wrote in his 2000-06-22 message, "If you can at all help it, start
refraining now from using U+FEFF as a zero-width non-breaking space," and I
seriously doubt that many applications have been doing this in any case,
compared to the number that use U+FEFF as a signature or byte-order mark.

I believe there is a common thread between this topic and the topic of Plane
14 tags (although I have pretty much conceded defeat on that one), namely
that there are those who believe a certain, limited amount of metadata is
appropriate in plain-text files, and those who believe that all metadata
should reside in a higher-level format or maybe that plain-text files are
irrelevant in the 21st century. In the case of UTF-8 signatures, I hope
there is some popular support for the notion that the U+FEFF signature is
more beneficial than harmful.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT