Re: UTF-8 BOM (Re: Charset declaration in HTML) from Steven Atreju on 2012-07-17 (Unicode Mail List Archive)

From: Steven Atreju <snatreju_at_googlemail.com>
Date: Tue, 17 Jul 2012 16:13:49 +0200

Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

|2012/7/16 Steven Atreju <snatreju_at_googlemail.com>:
|> Fifteen years ago i think i would have put effort in including the
|> BOM after reading this, for complete correctness! I'm pretty sure
|> that i really would have done so.
|
|Fifteen years ago I would not ahave advocated it. Simply because
|support of UTF-8 was very poor (and there were even differences of
|interpretations between the ISO/IEC definition and the Unicode
|definition, notably differences for the conformance requirements).
|This is no longer the case.
|
|> So, given that this page ranks 3 when searching for «utf-8 bom»
|> from within Germany i would 1), fix the «ecoding» typo and 2)
|> would change this to be less «neutral». The answer to «Q.» is
|> simply «Yes. Software should be capable to strip an encoded BOM
|> in UTF, because some softish Unicode processors fail to do so when
|> converting in between different multioctet UTF schemes. Using BOM
|> with UTF-8 is not recommended.»
|>
|> |> I know that, in Germany, many, many small libraries become closed
|> |> because there is not enough money available to keep up with the
|> |> digital race, and even the greater *do* have problems to stay in
|> |> touch!
|> |
|> |People like to complain about the BOM, but no libraries are shutting
|> |down because of it. "Keeping up with the digital race" isn't about
|> |handling two or three bytes at the beginning of a text file, in a way
|> |that has been defined for two decades.
|>
|> RFC 2279 doesn't note the BOM.
|>
|> Looking at my 119,90.- German Mark Unicode 3.0 book, there is
|> indeed talk about the UTF-8 BOM. We have (2.7, page 28)
|> «Conformance to the Unicode Standard does not requires the use of
|> the BOM as such a signature» (typo taken plain; or is it no
|> typo?), and (13.6, page 324) «..never any questions of byte order
|> with UTF-8 text, this sequence can serve as signature for .. this
|> sequence of bytes will be extremely rare at the beginning of text
|> files in other encodings ... for example []Microsoft Windows[]».
|>
|> So this is fine. It seems UTF-16 and UTF-32 were never ment for
|> data exchange and the BOM was really a byte order indicator for a
|> consumer that was aware of the encoding but not the byte order.
|> And UTF-8 got an additional «wohooo - i'm Unicode text» signature
|> tag, though optional. I like the term «extremely rare» sooo much!!
|> :-)
|
|No need to rant. There's the evidence that the role of BOM in UTF-8
|has been to help the migration from legacy charsets to Unicode, to
|avoid mojibake. And this role is still important. As UTF-8 became
|proeminent in interchanges, and the need for migration from older
|encodings largely augmented, this small signature has helped knowing
|which files were converted or not, even if there was no meta data
|(meta data is freuently dropped as soon as the ressource is no longer
|on a web server, but stored in a file of a local filesystem).
|
|As there are still a lot of local resources using other encodings, the
|signature really helps managing the local contents. And more and more
|applications will recognize this signature automatically to avoid
|using the default legacy encodings of the local system (something they
|still do in absence of meta data and of the BOM) : you no longer need
|to use a menu in apps to select the proper encoding (most often it is
|not available, or requires restarting the application or cancelling an
|ongoing transaction, and still frequently we still have to manage the
|situation were resources using legacy local encodings and those in
|UTF-8 are mixed in the application).
|
|The BOM is then extremely useful in a transition that will durate
|several decennials (or more) each time that resource is not strictly
|bound to the 7-bit US-ASCII subset.

I disagree, disagree, disagree :).

|I am also convinced that even Shell interpreters on Linux/Unix should
|recognize and accept the leading BOM before the hash/bang starting
|line (which is commonly used for filetype identification and runtime
|behavior), without claiming that they don"t know what to do to run the
|file or which shell interpreter to use.

Please let it be as agnostic as it is.
While watching the parade i've noticed that some standard Renault
trucks did not have a soot filter. That's a complete no-go. We
were shocked.

|PHP itself should be allowed to use it as well (but unfortunetaly it
|still does not have the concept of tracking the effective encoding to
|parse its scripts simply.
|
|Yes this requires modifying the database of filetype signatures, but
|this type of update has always been necessary since long for handling
|more and more filetypes (see for example the frequent updates and the
|growth of the "/etc/magic" database used by the Unix/Linux tool
|"file").

But i'm lucky that you mention this tool, since i've forgotten to
do so in my last post. It appeared first in 1973 and is a
standardized POSIX application and a part of all operating systems
i currently want to know of, including Mac OS X. It handles the
UTF-8 BOM the right way, possibly the only really right way. And
here is how:

|looks_utf8_with_BOM(const unsigned char *buf, size_t nbytes, unichar *ubuf,
| size_t *ulen)
|{
| if (nbytes > 3 && buf[0] == 0xef && buf[1] == 0xbb && buf[2] == 0xbf)
| return file_looks_utf8(buf + 3, nbytes - 3, ubuf, ulen);
| else
| return -1;
|}

So, if there is a BOM, check the rest for normal UTF-8 text.
(Without knowing all the details of the file(1) internals, i think
the heuristic won't match *without* treating the BOM in a special way.)
Better that is.

Steven
Received on Tue Jul 17 2012 - 09:18:03 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 17 2012 - 09:18:05 CDT