RE: Subject: Re: 32'nd bit & UTF-8

From: Martin Duerst (
Date: Mon Jan 24 2005 - 02:02:26 CST

  • Next message: Martin Duerst: "Re: 32'nd bit & UTF-8"

    At 13:54 05/01/20, Peter Constable wrote:

    >As for whether plain text files can have a BOM, that is one of the few
    >unending debates that arise with certain (fortunately not too freguent)
    >regularity, each time with vociferous expressions of deeply-held beliefs
    >but never any resolution. I'll just observe that the formal grammar for
    >XML does not make reference to a BOM, yet the XML spec certainly assumes
    >that a well-formed XML document may begin with a UTF-8 BOM (or a BOM in
    >any Unicode encoding form/scheme). Rather than have a philosophical
    >debate about the definition of "plain text file", I suggest a more
    >pragmatic approach: for better or worse, plain text processes that
    >support UTF-8 are going to encounter UTF-8 data beginning with a BOM:
    >learn to live with it!

    Just for your reference, I'd like to point out the following
    historical facts:

    - The fact that the BOM isn't part of the XML grammar is due to the
       fact that the BOM was always required for UTF-16 (but not for
       things such as UTF-16BE and UTF-16LE, which got defined later).
    - When XML was first defined and issued as a recommendation (Feb 1998),
       nobody in the XML community as far as I know was thinking about
       a BOM for UTF-8. The first edition of the XML Recommendation didn't
       say anything about a BOM for UTF-8. Also, the early XML Parsers
       didn't accept BOMs is the case of UTF-8.
    - When Notepad started to use a BOM for UTF-8, the responsible Working
       Group went back and took the lack of any statement about a BOM for
       UTF-8 in the XML Recommendation to say that this could mean either
       that the BOM was allowed or it was not allowed, and clarified that
       the BOM was indeed allowed for UTF-8. Many parsers have in the meantime
       been upgraded.

    So the fact that XML allows an UTF-8 BOM cannot be taken as an indication
    of how 'good' the BOM for UTF-8 is, but it can certainly be taken as
    an indication of its practical occurrence.

    Regards, Martin.

    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 19:27:27 CST