Re: UTF-8 signature in web and email

From: DougEwell2@cs.com
Date: Mon May 21 2001 - 11:39:08 EDT


In a message dated 2001-05-18 0:50:13 Pacific Daylight Time, duerst@w3.org
writes:

> People using this heuristic, who didn't really think it would
> work that well after the talk, have confirmed later that it
> actually works extremely well (and they were writing production
> code, not just testing stuff). On the other hand, I never met
> anybody who showed me an example where it actually didn't work.
> I would be interested to know about one if it exists.

Somebody on this list mentioned the rather contrived case of "NESTLÉ®", where
the LATIN CAPITAL LETTER E WITH ACUTE (0xC9) followed by REGISTERED SIGN
(0xAE) is also a valid UTF-8 sequence for U+026E LATIN LETTER LEZH.

But in most real-world cases, the heuristic is in fact very good, and I do
use it in all detection cases where I have to read the entire file anyway.
The value of the signature comes when quick detection, without reading the
entire file, is needed.

> The use of the signature may be easier than the heuristic in particular
> if you want to know before reading a file what the encoding of the
> file is. But in most cases, you will want to convert it somehow,
> and in that case, it's easy to just read in bytes, and decide
> lazily (i.e. when seing the first few high-octet bytes) whether
> to transcode the rest of the file e.g. as Latin-1 or as UTF-8.

That would certainly work, at least for those cases when you are converting
the file.

> Also, the signature really only helps if you are only dealing with
> two different encodings, a single legacy encoding and UTF-8.
> The signature won't help e.g. to keep apart Shift_JIS, EUC, and
> JIS (and UTF-8), but the heuristics used for these cases can
> easily be extended to UTF-8.

Of course, a signature or other mechanism for encoding A can never be useful
for telling encoding B from encoding C. The heuristic for identifying UTF-8
is indeed similar to (although simpler than) those for East Asian multibyte
encodings, but the fact that those encodings do not have even the
*possibility* to benefit from a signature does not make a strong case against
using the UTF-8 signature.

>> Programs that go bonkers when handed a BOM need to be corrected
>> to conform to the intent of the UTC.
>
> This would mean changing all compilers, all other software dealing with
> formated data, and so on, and all unix utilities from 'cat' upwards.
> In many cases, these applications and utilities are designed to work
> without knowing what the encoding is, they work on a byte stream.
> This makes it just impossible to conform to the above statement.
> If you have an idea how that can be solved, please tell us.

In the Windows world that I live in, we expect to update our compilers and
other tools every few years, for a variety of reasons (not all of which have
to do with marketing or planned obsolescence). This is both good and bad,
but in general it is just the way we tend to think. If upgrading a compiler
or similar tool is an extraordinary event for users of other systems, then
obviously UTF-8 signatures will cause problems -- but these programs will
also be unable to convert or otherwise interpret UTF-8, except to treat the
bytes as if they were in the native encoding.

Markus Kuhn, in his UTF-8 page, talks about "soft" and "hard" conversion and
the need to upgrade programs like wc that care what a "character" is.
Detecting a signature is a relatively "hard" conversion step.

> The problem goes even further. How should the 'signature' be handled
> in all the pieces of text data that may be passed around inside
> an application, or between applications, but not as files?
> Having to specify for each case who is responsible to add or
> remove the 'signature', and doing the actual work, is just crazy.

Here is what I do:

When reading, if a UTF-8 signature is encountered, strip it and treat it as a
promise that the file really is UTF-8. If no signature, apply the more
expensive heuristic.

When writing, add the signature unless I *know* the file is to be used by a
program that cannot handle it.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT