Re: UTF-8 BOM (Re: Charset declaration in HTML)

From: Doug Ewell <doug_at_ewellic.org>
Date: Mon, 16 Jul 2012 11:51:00 -0700

Steven Atreju wrote:

> Q: Is the UTF-8 encoding scheme the same irrespective of whether
> the underlying processor is little endian or big endian?
> ...
> Where a BOM is used with UTF-8, it is only used as an ecoding
> signature to distinguish UTF-8 from other encodings — it has
> nothing to do with byte order.
> ...
> So, given that this page ranks 3 when searching for «utf-8 bom» from
> within Germany i would 1), fix the «ecoding» typo and 2) would change
> this to be less «neutral». The answer to «Q.» is simply «Yes.
> Software should be capable to strip an encoded BOM in UTF, because
> some softish Unicode processors fail to do so when converting in
> between different multioctet UTF schemes. Using BOM with UTF-8 is not
> recommended.»

That's an answer to a different question. Yes, the UTF-8 encoding scheme
is the same irrespective of whether the underlying processor is
little-endian or big-endian. The FAQ question you quoted doesn't address
whether BOM is desirable for UTF-8. This is one reason I prefer the term
"signature" or "U+FEFF" instead of "BOM" when talking about UTF-8.

> RFC 2279 doesn't note the BOM.

RFC 2279 was superseded by RFC 3629 almost nine years ago. RFC 3629 has
a whole section (6) about the U+FEFF signature.

> Looking at my 119,90.- German Mark Unicode 3.0 book,

The Unicode 3.0 book was an excellent resource, but it was released
almost 12 years ago. Some of it may not reflect the latest information
or recommendations.

> there is indeed talk about the UTF-8 BOM. We have (2.7, page 28)
> «Conformance to the Unicode Standard does not requires the use of the
> BOM as such a signature» (typo taken plain; or is it no typo?), and
> (13.6, page 324) «..never any questions of byte order with UTF-8 text,
> this sequence can serve as signature for .. this sequence of bytes
> will be extremely rare at the beginning of text files in other
> encodings ... for example []Microsoft Windows[]».
>
> So this is fine. It seems UTF-16 and UTF-32 were never ment for data
> exchange and the BOM was really a byte order indicator for a consumer
> that was aware of the encoding but not the byte order.

The part of 13.6 you quoted doesn't make any statement at all about
UTF-16 or UTF-32. Back when Unicode was conceived, the 16-bit format was
the only one envisioned for data exchange.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­
Received on Mon Jul 16 2012 - 17:53:36 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 16 2012 - 17:53:37 CDT