Re: UTF-8 signature in web and email

From: Edward Cherlin (edward.cherlin.sy.67@aya.yale.edu)
Date: Fri May 18 2001 - 16:08:13 EDT


At 10:58 PM -0400 5/17/01, DougEwell2@cs.com wrote:
>The "UTF-8 signature" discussion appears every few months on this list,
>usually as a religious debate between those who believe in it and those who
>do not. Be forewarned, my religion may not match yours. :-)

My religion suggests that we find common ground and not engage in rwars.

>Keld Jørn Simonsen wrote:
>
>> For UTF-8 there is no need to have a BOM, as there is only one
>> way of serializing octets in UTF-8. There is no little-endian
>> or big-endian. A BOM is superfluous and will be ignored.

You could say "should be ignored", but you can't speak for everybody
else's software.

>The debate is not about whether byte order needs to be specified in a UTF-8
>file (of course it doesn't) but whether U+FEFF should be used as a signature
>to identify the file as UTF-8, rather than some other byte-oriented encoding.

Which will only work if the software is ready to handle it.

>Martin Dürst wrote:
>
>> There is about 5% of a justification
>> for having a 'signature' on a plain-text, standalone file (the reason
>> being that it's somewhat easier to detect that the file is UTF-8 from the
>> signature than to read through the file and check the byte patterns
>> (which is an extremely good method to distinguish UTF-8 from everything
> > else)).
[snip]
OK, that's enough context.

Last year, as previously the year before, we discussed the
possibility of defining some standard Unicode plain text formats. The
discussions foundered on the differences between text files meant for
people to read, such as e-mail, FAQs, and so on, and text files meant
for computers to process, such as delimited data files. We could not
agree, for example, whether a limit on line length was to be
required, permitted, or forbidden. We could not even agree that the
rules would be different for different cases, and that we would
attempt to enumerate the cases our standard would cover.

This BOM-as-signature debate is of the same type. Is it to be
required, permitted, forbidden, or something else? The short answer
is No. Users do not agree, and software cannot be made to agree, not
even if a formal standard were created and widely used.

Martin knows of no actual cases where a non-UTF-8 file could be
mistaken for UTF-8, so he says the signature is unnecessary, and goes
on to say that it is actually harmful. Specifically, he asks how all
Unix text-handling software could be made to work with a signature.
It can't all be changed, but here is a possible method for coping.

Create a filter that strips an initial signature from a text stream,
and passes the remainder through unchanged. You can be picky and make
it verify that the stream is in UTF-8, if you like.

Create a filter that adds a signature to the beginning of a text
stream, if it does not already have one. You can be picky, again.

Create a filter that can identify character sets heuristically and
convert them to UTF-8.

Write your scripts carefully, so that you know when you are handling
text in unknown character sets, and apply these filters as needed.

Then ordinary Unix utilities will be fed data that they will not
choke on, in known encodings without extraneous non-text data.

In all other contexts, such as XML, if the standard allows for a
signature, fine, and if not, don't use one. If there is no standard,
you have to negotiate a private agreement if you want to send people
something out of the ordinary.

Another way to look at the matter is to say that plain text is plain,
and a signature is markup. Then a text file with a signature is, if
not rich text, at least above the poverty line.

-- 

Edward Cherlin Generalist "A knot!" exclaimed Alice. "Oh, do let me help to undo it." Alice in Wonderland



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT