Re: UTF-8 BOM (Re: Charset declaration in HTML)

From: Steven Atreju <snatreju_at_googlemail.com>
Date: Tue, 17 Jul 2012 16:13:49 +0200

Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

 |2012/7/16 Steven Atreju <snatreju_at_googlemail.com>:
 |> Fifteen years ago i think i would have put effort in including the
 |> BOM after reading this, for complete correctness! I'm pretty sure
 |> that i really would have done so.
 |
 |Fifteen years ago I would not ahave advocated it. Simply because
 |support of UTF-8 was very poor (and there were even differences of
 |interpretations between the ISO/IEC definition and the Unicode
 |definition, notably differences for the conformance requirements).
 |This is no longer the case.
 |
 |> So, given that this page ranks 3 when searching for «utf-8 bom»
 |> from within Germany i would 1), fix the «ecoding» typo and 2)
 |> would change this to be less «neutral». The answer to «Q.» is
 |> simply «Yes. Software should be capable to strip an encoded BOM
 |> in UTF, because some softish Unicode processors fail to do so when
 |> converting in between different multioctet UTF schemes. Using BOM
 |> with UTF-8 is not recommended.»
 |>
 |> |> I know that, in Germany, many, many small libraries become closed
 |> |> because there is not enough money available to keep up with the
 |> |> digital race, and even the greater *do* have problems to stay in
 |> |> touch!
 |> |
 |> |People like to complain about the BOM, but no libraries are shutting
 |> |down because of it. "Keeping up with the digital race" isn't about
 |> |handling two or three bytes at the beginning of a text file, in a way
 |> |that has been defined for two decades.
 |>
 |> RFC 2279 doesn't note the BOM.
 |>
 |> Looking at my 119,90.- German Mark Unicode 3.0 book, there is
 |> indeed talk about the UTF-8 BOM. We have (2.7, page 28)
 |> «Conformance to the Unicode Standard does not requires the use of
 |> the BOM as such a signature» (typo taken plain; or is it no
 |> typo?), and (13.6, page 324) «..never any questions of byte order
 |> with UTF-8 text, this sequence can serve as signature for .. this
 |> sequence of bytes will be extremely rare at the beginning of text
 |> files in other encodings ... for example []Microsoft Windows[]».
 |>
 |> So this is fine. It seems UTF-16 and UTF-32 were never ment for
 |> data exchange and the BOM was really a byte order indicator for a
 |> consumer that was aware of the encoding but not the byte order.
 |> And UTF-8 got an additional «wohooo - i'm Unicode text» signature
 |> tag, though optional. I like the term «extremely rare» sooo much!!
 |> :-)
 |
 |No need to rant. There's the evidence that the role of BOM in UTF-8
 |has been to help the migration from legacy charsets to Unicode, to
 |avoid mojibake. And this role is still important. As UTF-8 became
 |proeminent in interchanges, and the need for migration from older
 |encodings largely augmented, this small signature has helped knowing
 |which files were converted or not, even if there was no meta data
 |(meta data is freuently dropped as soon as the ressource is no longer
 |on a web server, but stored in a file of a local filesystem).
 |
 |As there are still a lot of local resources using other encodings, the
 |signature really helps managing the local contents. And more and more
 |applications will recognize this signature automatically to avoid
 |using the default legacy encodings of the local system (something they
 |still do in absence of meta data and of the BOM) : you no longer need
 |to use a menu in apps to select the proper encoding (most often it is
 |not available, or requires restarting the application or cancelling an
 |ongoing transaction, and still frequently we still have to manage the
 |situation were resources using legacy local encodings and those in
 |UTF-8 are mixed in the application).
 |
 |The BOM is then extremely useful in a transition that will durate
 |several decennials (or more) each time that resource is not strictly
 |bound to the 7-bit US-ASCII subset.

I disagree, disagree, disagree :).

 |I am also convinced that even Shell interpreters on Linux/Unix should
 |recognize and accept the leading BOM before the hash/bang starting
 |line (which is commonly used for filetype identification and runtime
 |behavior), without claiming that they don"t know what to do to run the
 |file or which shell interpreter to use.

Please let it be as agnostic as it is.
While watching the parade i've noticed that some standard Renault
trucks did not have a soot filter. That's a complete no-go. We
were shocked.

 |PHP itself should be allowed to use it as well (but unfortunetaly it
 |still does not have the concept of tracking the effective encoding to
 |parse its scripts simply.
 |
 |Yes this requires modifying the database of filetype signatures, but
 |this type of update has always been necessary since long for handling
 |more and more filetypes (see for example the frequent updates and the
 |growth of the "/etc/magic" database used by the Unix/Linux tool
 |"file").

But i'm lucky that you mention this tool, since i've forgotten to
do so in my last post. It appeared first in 1973 and is a
standardized POSIX application and a part of all operating systems
i currently want to know of, including Mac OS X. It handles the
UTF-8 BOM the right way, possibly the only really right way. And
here is how:

 |looks_utf8_with_BOM(const unsigned char *buf, size_t nbytes, unichar *ubuf,
 | size_t *ulen)
 |{
 | if (nbytes > 3 && buf[0] == 0xef && buf[1] == 0xbb && buf[2] == 0xbf)
 | return file_looks_utf8(buf + 3, nbytes - 3, ubuf, ulen);
 | else
 | return -1;
 |}

So, if there is a BOM, check the rest for normal UTF-8 text.
(Without knowing all the details of the file(1) internals, i think
the heuristic won't match *without* treating the BOM in a special way.)
Better that is.

  Steven
Received on Tue Jul 17 2012 - 09:18:03 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 17 2012 - 09:18:05 CDT