Re: UTF-8 signature in web and email

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Fri May 18 2001 - 17:10:22 EDT


michka

the only book on internationalization in VB at
http://www.i18nWithVB.com/

----- Original Message -----
From: "Edward Cherlin" <edward.cherlin.sy.67@aya.yale.edu>
To: <unicode@unicode.org>
Sent: Friday, May 18, 2001 1:08 PM
Subject: Re: UTF-8 signature in web and email

> At 10:58 PM -0400 5/17/01, DougEwell2@cs.com wrote:
> >The "UTF-8 signature" discussion appears every few months on this list,
> >usually as a religious debate between those who believe in it and those
who
> >do not. Be forewarned, my religion may not match yours. :-)
>
> My religion suggests that we find common ground and not engage in rwars.
>
> >Keld Jørn Simonsen wrote:
> >
> >> For UTF-8 there is no need to have a BOM, as there is only one
> >> way of serializing octets in UTF-8. There is no little-endian
> >> or big-endian. A BOM is superfluous and will be ignored.
>
> You could say "should be ignored", but you can't speak for everybody
> else's software.
>
> >The debate is not about whether byte order needs to be specified in a
UTF-8
> >file (of course it doesn't) but whether U+FEFF should be used as a
signature
> >to identify the file as UTF-8, rather than some other byte-oriented
encoding.
>
> Which will only work if the software is ready to handle it.
>
> >Martin Dürst wrote:
> >
> >> There is about 5% of a justification
> >> for having a 'signature' on a plain-text, standalone file (the reason
> >> being that it's somewhat easier to detect that the file is UTF-8 from
the
> >> signature than to read through the file and check the byte patterns
> >> (which is an extremely good method to distinguish UTF-8 from
everything
> > > else)).
> [snip]
> OK, that's enough context.
>
> Last year, as previously the year before, we discussed the
> possibility of defining some standard Unicode plain text formats. The
> discussions foundered on the differences between text files meant for
> people to read, such as e-mail, FAQs, and so on, and text files meant
> for computers to process, such as delimited data files. We could not
> agree, for example, whether a limit on line length was to be
> required, permitted, or forbidden. We could not even agree that the
> rules would be different for different cases, and that we would
> attempt to enumerate the cases our standard would cover.
>
> This BOM-as-signature debate is of the same type. Is it to be
> required, permitted, forbidden, or something else? The short answer
> is No. Users do not agree, and software cannot be made to agree, not
> even if a formal standard were created and widely used.
>
> Martin knows of no actual cases where a non-UTF-8 file could be
> mistaken for UTF-8, so he says the signature is unnecessary, and goes
> on to say that it is actually harmful. Specifically, he asks how all
> Unix text-handling software could be made to work with a signature.
> It can't all be changed, but here is a possible method for coping.
>
> Create a filter that strips an initial signature from a text stream,
> and passes the remainder through unchanged. You can be picky and make
> it verify that the stream is in UTF-8, if you like.
>
> Create a filter that adds a signature to the beginning of a text
> stream, if it does not already have one. You can be picky, again.
>
> Create a filter that can identify character sets heuristically and
> convert them to UTF-8.
>
> Write your scripts carefully, so that you know when you are handling
> text in unknown character sets, and apply these filters as needed.
>
> Then ordinary Unix utilities will be fed data that they will not
> choke on, in known encodings without extraneous non-text data.
>
> In all other contexts, such as XML, if the standard allows for a
> signature, fine, and if not, don't use one. If there is no standard,
> you have to negotiate a private agreement if you want to send people
> something out of the ordinary.
>
>
> Another way to look at the matter is to say that plain text is plain,
> and a signature is markup. Then a text file with a signature is, if
> not rich text, at least above the poverty line.
> --
>
> Edward Cherlin
> Generalist
> "A knot!" exclaimed Alice. "Oh, do let me help to undo it."
> Alice in Wonderland
>
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT