Re: UTF-8 BOM (Re: Charset declaration in HTML)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 16 Jul 2012 20:15:37 +0200

2012/7/16 Steven Atreju <snatreju_at_googlemail.com>:
> Fifteen years ago i think i would have put effort in including the
> BOM after reading this, for complete correctness! I'm pretty sure
> that i really would have done so.

Fifteen years ago I would not ahave advocated it. Simply because
support of UTF-8 was very poor (and there were even differences of
interpretations between the ISO/IEC definition and the Unicode
definition, notably differences for the conformance requirements).
This is no longer the case.

> So, given that this page ranks 3 when searching for «utf-8 bom»
> from within Germany i would 1), fix the «ecoding» typo and 2)
> would change this to be less «neutral». The answer to «Q.» is
> simply «Yes. Software should be capable to strip an encoded BOM
> in UTF, because some softish Unicode processors fail to do so when
> converting in between different multioctet UTF schemes. Using BOM
> with UTF-8 is not recommended.»
>
> |> I know that, in Germany, many, many small libraries become closed
> |> because there is not enough money available to keep up with the
> |> digital race, and even the greater *do* have problems to stay in
> |> touch!
> |
> |People like to complain about the BOM, but no libraries are shutting
> |down because of it. "Keeping up with the digital race" isn't about
> |handling two or three bytes at the beginning of a text file, in a way
> |that has been defined for two decades.
>
> RFC 2279 doesn't note the BOM.
>
> Looking at my 119,90.- German Mark Unicode 3.0 book, there is
> indeed talk about the UTF-8 BOM. We have (2.7, page 28)
> «Conformance to the Unicode Standard does not requires the use of
> the BOM as such a signature» (typo taken plain; or is it no
> typo?), and (13.6, page 324) «..never any questions of byte order
> with UTF-8 text, this sequence can serve as signature for .. this
> sequence of bytes will be extremely rare at the beginning of text
> files in other encodings ... for example []Microsoft Windows[]».
>
> So this is fine. It seems UTF-16 and UTF-32 were never ment for
> data exchange and the BOM was really a byte order indicator for a
> consumer that was aware of the encoding but not the byte order.
> And UTF-8 got an additional «wohooo - i'm Unicode text» signature
> tag, though optional. I like the term «extremely rare» sooo much!!
> :-)

No need to rant. There's the evidence that the role of BOM in UTF-8
has been to help the migration from legacy charsets to Unicode, to
avoid mojibake. And this role is still important. As UTF-8 became
proeminent in interchanges, and the need for migration from older
encodings largely augmented, this small signature has helped knowing
which files were converted or not, even if there was no meta data
(meta data is freuently dropped as soon as the ressource is no longer
on a web server, but stored in a file of a local filesystem).

As there are still a lot of local resources using other encodings, the
signature really helps managing the local contents. And more and more
applications will recognize this signature automatically to avoid
using the default legacy encodings of the local system (something they
still do in absence of meta data and of the BOM) : you no longer need
to use a menu in apps to select the proper encoding (most often it is
not available, or requires restarting the application or cancelling an
ongoing transaction, and still frequently we still have to manage the
situation were resources using legacy local encodings and those in
UTF-8 are mixed in the application).

The BOM is then extremely useful in a transition that will durate
several decennials (or more) each time that resource is not strictly
bound to the 7-bit US-ASCII subset.

I am also convinced that even Shell interpreters on Linux/Unix should
recognize and accept the leading BOM before the hash/bang starting
line (which is commonly used for filetype identification and runtime
behavior), without claiming that they don"t know what to do to run the
file or which shell interpreter to use.

PHP itself should be allowed to use it as well (but unfortunetaly it
still does not have the concept of tracking the effective encoding to
parse its scripts simply.

Yes this requires modifying the database of filetype signatures, but
this type of update has always been necessary since long for handling
more and more filetypes (see for example the frequent updates and the
growth of the "/etc/magic" database used by the Unix/Linux tool
"file").
Received on Mon Jul 16 2012 - 17:10:10 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 16 2012 - 17:10:10 CDT