Re: UTF-8 BOM (Re: Charset declaration in HTML)

From: Steven Atreju <snatreju_at_googlemail.com>
Date: Mon, 16 Jul 2012 13:35:04 +0200

"Doug Ewell" <doug_at_ewellic.org> wrote:

 |Steven Atreju wrote:
 |
 |> If Unicode *defines* that the so-called BOM is in fact a Unicode-
 |> indicating tag that MUST be present,
 |
 |But Unicode does not define that.

Nope. On http://unicode.org/faq/utf_bom.html i read:

  Q: Why do some of the UTFs have a BE or LE in their label,
  such as UTF-16LE?

So it seems to me that the Unicode Consortium takes care of
newbies and those people who work at a very high programming
level, say, PHP, Flash, JavaScript or even no programming at all.
And:

  Q: Is the UTF-8 encoding scheme the same irrespective of whether
  the underlying processor is little endian or big endian?
  ...
  Where a BOM is used with UTF-8, it is only used as an ecoding
  signature to distinguish UTF-8 from other encodings — it has
  nothing to do with byte order.

Fifteen years ago i think i would have put effort in including the
BOM after reading this, for complete correctness! I'm pretty sure
that i really would have done so.

So, given that this page ranks 3 when searching for «utf-8 bom»
from within Germany i would 1), fix the «ecoding» typo and 2)
would change this to be less «neutral». The answer to «Q.» is
simply «Yes. Software should be capable to strip an encoded BOM
in UTF, because some softish Unicode processors fail to do so when
converting in between different multioctet UTF schemes. Using BOM
with UTF-8 is not recommended.»

 |> I know that, in Germany, many, many small libraries become closed
 |> because there is not enough money available to keep up with the
 |> digital race, and even the greater *do* have problems to stay in
 |> touch!
 |
 |People like to complain about the BOM, but no libraries are shutting
 |down because of it. "Keeping up with the digital race" isn't about
 |handling two or three bytes at the beginning of a text file, in a way
 |that has been defined for two decades.

RFC 2279 doesn't note the BOM.

Looking at my 119,90.- German Mark Unicode 3.0 book, there is
indeed talk about the UTF-8 BOM. We have (2.7, page 28)
«Conformance to the Unicode Standard does not requires the use of
the BOM as such a signature» (typo taken plain; or is it no
typo?), and (13.6, page 324) «..never any questions of byte order
with UTF-8 text, this sequence can serve as signature for .. this
sequence of bytes will be extremely rare at the beginning of text
files in other encodings ... for example []Microsoft Windows[]».

So this is fine. It seems UTF-16 and UTF-32 were never ment for
data exchange and the BOM was really a byte order indicator for a
consumer that was aware of the encoding but not the byte order.
And UTF-8 got an additional «wohooo - i'm Unicode text» signature
tag, though optional. I like the term «extremely rare» sooo much!!
:-)

I restart my «rant» UTF-8 filetype thread from the beginning now.
I wonder: was the Unicode Consortium really so unconfident? Do i
really read «UTF-8 will drown in this evil mess of terroristic
charsets, so rise the torch of freedom in this unfriendly
environment!»?
I have downloaded the 6.0 and 6.1 stuff as a PDF and for free (:->.

If you know how to deal with UTF-8, you can deal with UTF-8.
If you don't, no signature ever will help you, no?!

If you don't know the charset of some text, that comes from
nowhere, i.e., no container format with meta-information, no
filetype extension with implicit meta-information, as is used on
Mac OS and DOS, then UTF-8 is still very easily identifieable by
itself due to the way the algorithm is designed. Is it??

Tear down the wall!
Tear down the wall!
Tear down the wall!

 |It's about technologies and
 |standards and platforms and formats that change incompatibly every few
 |years.

That is of course true.

But what to do with these myriads of aggressive nerds that linger
in these neon-enlightened four square meter boxes, with their
poignant hunger for penthouse windows and four-cylinder
Mercedes-Benz limousines? I'm asking you. I've seen photos of
standard committees in palm-covered bays (CSS2? DOM? W3M
anyway), i've dropped my subscription to regular IETF discussion
because i can stand only so and so many dozens of dinner,
hotel-room reservation, laptop-compatible socket in Paris? and
whatever threads (the annual ladies steakhouse meeting!). So here
you are. These people have deserved it, and no better.

  Steven
Received on Mon Jul 16 2012 - 06:43:19 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 16 2012 - 06:43:48 CDT