BOM in HTML (was Conformance (was UTF, BOM, etc))

From: Jon Hanna (
Date: Sat Jan 22 2005 - 07:03:27 CST

  • Next message: Lars Kristan: "wchar_t (was RE: 32'nd bit & UTF-8)"

    > As for the .htm, I have to admit I don't know what standards
    > say. Frankly, I don't care. Whatever they say, they might be
    > wrong. IMO, HTML files are plain text. Encoding issues are
    > covered by the directives. Encoding could even be switched
    > within that document. It already is. Up to the first
    > directive, the encoding is ASCII. At least I would define it
    > that way, don't know if it actually is. If the BOM is
    > allowed, it should only be valid (if at all) up until the
    > first directive. Opening a .htm file in text mode might then
    > be a pain.

    HTML files are documents whose encoding is generally stated out-of-band
    (they are after all primarily used on the web).

    HTML files can contain <meta /> elements that MAY be used to determine
    encoding in the absence of such out of band information, and all the
    characters in a valid <meta /> element about the encoding would be US-ASCII
    and so would be identical to the US-ASCII in any encoding, including UTF-8
    with which US-ASCII is forwards-compatible. (Strictly speaking the <meta />
    element with a http-equiv attribute gives instructions that a server may
    read to determine what HTTP headers it could send, and browsers, editors and
    other user-agents MAY read these elements and act as if the appropriate
    header had been sent but are free not to.

    The current version, and all foreseen future versions, of HTML is a version
    of XHTML which is an XML application, and hence XML declarations can be used
    in the absence of out-of-band information.

    Encoding *is not* switched in the middle of a document. However, the correct
    encoding may not be known until some of the document has been already
    processed, which may require some of it to be reprocessed. In the case of
    XML declarations however no characters are allowed to appear before the
    declaration in the document so with their will be no re-processing necessary
    though it is possible that the declaration will trigger an error, if
    previously processed characters (i.e. the start of the element itself and
    possibly a BOM in one of the UTF encodings) are invalid in the encoding in

    XML encoded in UTF-8 may begin with a BOM (this was implied in the first
    drafts of the XML spec and later explicitly decided on). I don't know of any
    explicit mention of the BOM with regards to HTML4.01 and earlier.

    Jon Hanna
    Work: <>
    Play: <>
    Chat: <irc://>

    This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 07:08:59 CST