Re: (Informational only: UTF-8 BOM and the real life)

From: Steven Atreju <>
Date: Sat, 28 Jul 2012 14:38:50 +0200

"Leif H Silli" <> wrote:

 |Steven Atreju on 28/7/'12, 0:22:
 |> "Doug Ewell" wrote:
 |> |> Well, i still see a bug in the Unicode Standard here.
 |> |> Whereas for the multioctet UTFs there is «The BOM is not
 |> |> considered part of the content of the text» (Conformance, 3.10,
 |> |> D98, D101), i cannot find any such clarifying text for it's usage
 |> |> as a signature.
 |> |
 |> |There really isn't as much difference between using U+FEFF "as a byte
 |> |order mark" and using it "as a signature" as this makes it seem. The
 |> |definitions you quote have to do with whether U+FEFF is treated as a
 |> |BOM/signature or as a zero-width no-break space.
 |> I really think that a clarification in equal spirit to those of
 |> D98 and D101 (but maybe with different content :) would be an
 |> improvement of the Unicode Standard.
 |I agree with Doug that there is no enormous diff between "BOM" and "encoding signature". In XML 1.0 the BOM is in fact described as a signature regardless of which unicode encoding it is used with:

Yes, simply spoken out and clarified like that, and everybody
knows what to deal with.

And btw., my local copy of XML 1.1 (Second Edition, thus current)
doesn't include this paragraph (in the referenced 4.3.3):

  |If the replacement text of an external entity is to begin with
  |the character U+FEFF, and no text declaration is present, then
  |a Byte Order Mark MUST be present, whether the entity is encoded
  |in UTF-8 or UTF-16.

But i don't see the big picture of all that markup standards, i'm
just have them in case my own work raises some questions..

 |Also, whether UTF-16 is one ore two encodings is a definition question. (Microsoft at one time defined it as two encodings.)
 |Leif Halvard Silli

Received on Sat Jul 28 2012 - 07:44:44 CDT

This archive was generated by hypermail 2.2.0 : Sat Jul 28 2012 - 07:44:48 CDT