Re: pre-HTML5 and the BOM

From: Philippe Verdy <>
Date: Fri, 13 Jul 2012 18:33:35 +0200

> Fra: Jukka K. Korpela <>
>> "When the BOM is used in web pages or editors for UTF-8 encoded content it
>> can sometimes introduce blank spaces or short sequences of strange-looking
>> characters (such as ). For this reason, it is usually best for
>> interoperability to omit the BOM, when given a choice, for UTF-8 content."
>> In reality, BOM surely helps rather than hurts, especially when a document
>> is saved locally and HTTP headers are thereby lost. Authoring tools may have
>> problems with it (and then again, some tools have problems with UTF-8 files
>> that _lack_ BOM).

This stetemant for maximum interoperability may have been true in the
past, where Unicode support was not so universal and still not adopted
formally for all newer developments in RFCs published by the IETF. But
now the situation is reversed : maximum interoperability if offerd
when BOMs are present, not really to indicate the byte order itself,
but to confirm that the content is Unicode encoded and extremely
likely to be text content and not arbitrary binary contents (that
today almost always use a distinctive leading signature).

Without the BOM we remain in the old practice of using host-specific
and unspecified default encodings, which do not survive any
transmission from one system to another, or from one user to another
(the worst appearing when the default decoding used depends on the
viewing user of the service; only because he speaks a different
language with a basic setting that implies a different default
encoding ; users generally dont know how to set the encodings, and
will refuse to change their environment constantly depending on the
services or contents they want to access too : this does not work
today when we live in a world of applications and services provided
from many simultaneous sources created in a highly heterogeneous
worldwide network).

BOMS are helping much more than they hurt today (and most places where
they hurt are on systems that should have been updated since long due
to the many discovered security holes in them, constantly harnessed by
lots of attacks, and only partly fixed by security suites). Those old
systems are also very frequently much less performant now (using the
same hardware resources), notably everything related to filesystems
and to Internet protocols such as web browsers,

Even if we don't reencode the archives, we have now very simple and
fast conversion tools that allow "reconnecting" these archives to the
modern world in a transparent way (if they are archives, the data they
store is readonly and van be accessed by a transparent filter whose
speed can also be improved by internal caching using newer, faster,
and cheaper storage solutions : these transparent and caching proxies
also help preserving the precious archives)
Received on Fri Jul 13 2012 - 11:37:23 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 11:37:24 CDT