Re: UTF-8 signature in web and email

From: Martin Duerst (duerst@w3.org)
Date: Fri May 18 2001 - 03:49:21 EDT


At 22:58 01/05/17 -0400, DougEwell2@cs.com wrote:
>Martin D$B—S(Bst wrote:
>
> > There is about 5% of a justification
> > for having a 'signature' on a plain-text, standalone file (the reason
> > being that it's somewhat easier to detect that the file is UTF-8 from the
> > signature than to read through the file and check the byte patterns
> > (which is an extremely good method to distinguish UTF-8 from everything
> > else)).
>
>A plain-text file is more in need of such a signature than any other type of
>file. It is true that "fancy" text such as HTML or XML, which already has a
>mechanism to indicate the character encoding, doesn't need a signature, but
>this is not necessarily true of plain-text files, which will continue to
>exist for a long time to come.
>
>The strategy of checking byte patterns to detect UTF-8 is usually accurate,
>but may require that the entire file be checked instead of just the first
>three bytes. In his September 1997 presentation in San Jose, Martin conceded
>that "Because probability to detect UTF-8 [without a signature] is high, but
>not 100%, this is a heuristic method" and then spent several pages evaluating
>and refining the heuristics. Using a signature is not somewhat easier, it is
>*much* easier.

Sorry, but I think your summary here is a bit slanted.
I indeed used several pages, but the main aim was to show that
in practice, it's virtually 100%, for many different cases.
People using this heuristic, who didn't really think it would
work that well after the talk, have confirmed later that it
actually works extremely well (and they were writing production
code, not just testing stuff). On the other hand, I never met
anybody who showed me an example where it actually didn't work.
I would be interested to know about one if it exists.

I just said 'high, but not exactly 100%', because it was a technical
talk and not a marketing talk. Could be that this wasn't
easy to understand for some of the audience? There is no actual
need in practice to refine the heuristics.

The use of the signature may be easier than the heuristic in particular
if you want to know before reading a file what the encoding of the
file is. But in most cases, you will want to convert it somehow,
and in that case, it's easy to just read in bytes, and decide
lazily (i.e. when seing the first few high-octet bytes) whether
to transcode the rest of the file e.g. as Latin-1 or as UTF-8.

Also, the signature really only helps if you are only dealing with
two different encodings, a single legacy encoding and UTF-8.
The signature won't help e.g. to keep apart Shift_JIS, EUC, and
JIS (and UTF-8), but the heuristics used for these cases can
easily be extended to UTF-8.

> > - When producing UTF-8 files/documents, *never* produce a 'signature'.
> > There are quite some receivers that cannot deal with it, or that deal
> > with it by displaying something. And there are many other problems.
>
>If U+FEFF is not interpreted as a BOM or signature, then by process of
>elimination it should be interpreted as a zero-width no-break space (ZWNBSP;
>more on this later). Any receiver that deals with a ZWNBSP by displaying a
>visible glyph is not very smart about they way it handles Unicode text, and
>should not be the deciding factor in how to encode it.

Don't think that display is everything that can be done to a text
file. An XML processor that doesn't expect a signature in UTF-8
will correctly reject the file if the signature comes before
an XML declaration. Same for many other formats and languages.

>What are the "many other problems"? Does this comment refer to programs and
>protocols that require their own signatures as the first few bytes of an
>input file (like shell scripts)? The Unicode Standard 3.0 explicitly states
>on page 325, "Systems that use the byte order mark must recognize that an
>initial U+FEFF signals the byte order; it is not part of the textual
>content." Programs that go bonkers when handed a BOM need to be corrected to
>conform to the intent of the UTC.

This would mean changing all compilers, all other software dealing with
formated data, and so on, and all unix utilities from 'cat' upwards.
In many cases, these applications and utilities are designed to work
without knowing what the encoding is, they work on a byte stream.
This makes it just impossible to conform to the above statement.
If you have an idea how that can be solved, please tell us.

The problem goes even further. How should the 'signature' be handled
in all the pieces of text data that may be passed around inside
an application, or between applications, but not as files?
Having to specify for each case who is responsible to add or
remove the 'signature', and doing the actual work, is just crazy.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT