RE: UTF-8 in email

From: Murray Sargent (murrays@microsoft.com)
Date: Fri Oct 16 1998 - 16:36:28 EDT


As Markus points out, when MIME is active you don't need the BOM. The BOM
is only needed in simple plain-text files, i.e., not in HTML, RTF, or other
files employing a higher-level protocol. The key point is that a text
program needs to know which character set to use. With higher-level
protocols, the answer should be embedded in the appropriate control words or
tags. In the absence of such protocols, e.g., plain-text files on disk, the
BOM is a very simple way to identify which Unicode encoding is used by a
file. It's true that this simple convention only works with Unicode files
(either UTF-16 or UTF-8) and so other plain-text files, e.g., Shift-JIS
encoded, still are problematic. But at least with Unicode files, a text
program can know what charset to use without resorting to complicated and
sometimes error-prone heuristics.

So it's worth getting to like the BOM for reasons quite different from its
original purpose :-) If you've ever written a text editor or word
processing program that deals with multiple file formats, you'll surely know
what I mean.

Btw, the recommendation was adopted at the last UTC meeting. You can still
create plain-text UTF-8 files without the leading BOM. But they might not
get read correctly by the software out there...

Thanks
Murray

> -----Original Message-----
> From: Markus Kuhn [SMTP:Markus.Kuhn@cl.cam.ac.uk]
> Sent: Friday, October 16, 1998 7:02 AM
> To: Unicode List
> Subject: Re: UTF-8 in email
>
> Murray Sargent wrote on 1998-10-16 00:23 UTC:
> > Donald Page wrote:
> > > The above attachment should contain all of the Minimum European Subset
> > > encoded as UTF-8. I created it for my own testing, but feel free to
> use
> > > it.
> > Donald's UTF-8 file should begin with a UTF-8 BOM in order to identify
> it as
> > a UTF-8 encoded file. The starting bytes should be 0xEF 0xBB 0xBF.
>
> No. The MIME attachment should just contain the header line
>
> Content-Type: text/plain; charset=UTF-8
>
> as specified in RFC 2044, and then the receiving email client should
> know how to activate the UTF-8 decoder and how to select an appropriate
> font. Most developers of email clients still have to add a bit here to
> get this running as it is supposed to work.
>
> I do not like BOMs. The whole beauty of UTF-8 is that it is stateless,
> and introducing Byte-Order-Marker-Hacks destroys this. What happens to
> BOMs in a cut&paste context? It just creates a mess.
>
> If you want to switch properly between different encodings, then use
> established complete mechanisms like the MIME charset identifier or the
> ISO 2022 ESC sequences. BOMs are just an ugly hack.
>
> > These bytes are discarded when reading the file in and added when
> > writing the file out.
>
> I am not sure what exactly you mean, but I hope it is the following: If
> you are working on an unfortunate platform that requires BOMs in all
> UTF-8 files, then the email software on that platform should prefix the
> BOM to a file whenever a MIME text/plain UTF-8 body part is saved into a
> file. If a file starting with a BOM is attached to an email as a text/
> plain file, then the BOM should be stripped of and the MIME
> charset=UTF-8 header should be added.
>
> Markus
>
> --
> Markus G. Kuhn, Security Group, Computer Lab, Cambridge University, UK
> email: mkuhn at acm.org, home page: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT