Re: UTF-8 signature in web and email

From: Michael \(michka\) Kaplan (
Date: Wed May 16 2001 - 07:33:57 EDT

It would be most likely that "Dr. International" ( sent
the mail from Microsoft did so from his/her Outlook machine (probably
Outlook 2002, I do not think Outlook 2000 ever did this). Perhaps someone
could follow up with the Outlook folks on their decision to include a BOM at
the beginning of UTF-8 section of HTML mail? Assuming that Dr. International
is on the Unicode List, then he/she might be the best person to follow up!

Clearly there is no standard suggesting such a thing, and while I do see
Martin's suggestions below as something of a reversal from other people's
ideas of best practices, the BOM for UTF-8 and other encodings is clearly
intended for cases of plain text, not text that has a higher-level protocol
that contains encoding information.


Michael Kaplan
Trigeminal Software, Inc.

----- Original Message -----
From: "Martin Duerst" <>
To: "Roozbeh Pournader" <>; "Unicode List"
<>; <>
Sent: Tuesday, May 15, 2001 6:55 PM
Subject: Re: UTF-8 signature in web and email

> Hello Roozbeh
> At 04:02 01/05/15 +0430, Roozbeh Pournader wrote:
> >Well, I received a UTF-8 email from Microsoft's Dr International today.
> >was a "multipart/alternative", with both the "text/plain" and "text/html"
> >in UTF-8. Well, nothing interesting yet, but the interesting point was
> >that the HTML version had a UTF-8 signature, but the text version lacked
> >it. So, the HTML version had it three times: mime charset as UTF-8,
> >UTF-8 signature, and <meta> charset markup.
> This is definitely overblown. There is about 5% of a justification
> for having a 'signature' on a plain-text, standalone file (the reason
> being that it's somewhat easier to detect that the file is UTF-8 from the
> signature than to read through the file and check the byte patterns
> (which is an extremely good method to distinguish UTF-8 from everything
> else)). For self-labeled data (HTML, XML, CSS) and in the context
> of MIME (with the charset parameter), an UTF-8 signature doesn't
> make sense at all.
> >Questions:
> >
> >1. What are the current recommendations for these?
> - When producing UTF-8 files/documents, *never* produce a 'signature'.
> There are quite some receivers that cannot deal with it, or that deal
> with it by displaying something. And there are many other problems.
> - When receiving UTF-8, you probably should check for a 'signature'
> and remove it. There are too many applications that send one out,
> unfortunately.
> >2. Most important of all, does W3C allow UTF-8 signatures before
> >"<!DOCTYPE>"? And if yes, what should be done if they mismatch the
> >charset as can be described in the <meta> tag?
> For text/html, neither the HTML spec nor the IETF definition of UTF-8
> (RFC 2279) says anything as far as I know. The reason was that nobody
> thought about an UTF-8 signature at that time.
> For XML, the 'signature' is now listed in App F.1
> But this is not normative, and fairly recent, and so you should never
> expect an XML processor to accept it (except as a plain character
> in the file when there is no XML declaration).
> Regards, Martin.

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT