Re: pre-HTML5 and the BOM

From: Asmus Freytag <>
Date: Fri, 13 Jul 2012 15:28:01 -0700

On 7/13/2012 2:42 PM, David Starner wrote:
> On Fri, Jul 13, 2012 at 1:29 PM, Jukka K. Korpela <> wrote:
>> 2012-07-13 22:37, David Starner wrote:
>>> Wikipedia says "The Unicode standard recommends against the BOM for
>>> UTF-8." and refers to page 30 of the Unicode Standard, version 6.0,
>>> that says "Use of a BOM is neither required nor recommended for
>>> UTF-8..." Calling it a myth seems bizarre.
>> “Not recommended” is distinct from “recommends against”.
> I disagree; the meaning of the two phrases overlaps in my idolect, and
> while it would be somewhat laconic, I might use "not recommended" to
> mean "if you insist on doing that, please give us a chance to get the
> fire extinguisher first",
I can state confidently and unequivocally that it is not used in that
sense in the standard, and by reading the whole phrase it's clear that
it is intended as statement of neutrality on the part of the Unicode
Standard - respectfully being aware of the difference between a
character encoding and a data transmission (or file format) protocol.
>> A
>> more appropriate formulation would be “Use of a BOM is not required for BOM,
>> but may be used as a signature that indicates, with practical certainty,
>> that data is UTF-8 encoded.”
> In the environment that UTF-8 was developed for, a BOM is a nuisance;
> a BOM will stop the shell from properly interpreting a hashbang, and
> other existing programs will lose the BOM, duplicate the BOM, and
> scatter BOMs throughout files. Given the number of text-like file
> formats (like old-school PNM) and number of scripts depending on
> existing behavior, these aren't going to be changed.

I think it's the cost of doing business. Unix was successful in getting
the web to use UTF-8 rather than UTF-16 etc. files to be the basis for
the exchange of markup language data. In environments that are
predicated on mandatory conversion TO Unicode, not knowing whether
something is "text" or "utf-8" text isn't as benign as it might be in
the former environment. Hence, the implementation of the UTF-8 BOM there.

> As I said before, Unicode simplified but did not solve the fact that
> text from other operating systems requires some modification before
> working just right. But I don't think that Unicode should recommend
> unconditionally the UTF-8 BOM, because it is problematic in the field
> of use UTF-8 was created for and is still used for.

And, as you can see, Unicode, as a standard, is neutral on the issue.

For precisely that reason!

Received on Fri Jul 13 2012 - 17:33:16 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 17:33:28 CDT