BOM in HTML (was Conformance (was UTF, BOM, etc))

From: Jon Hanna (jon@hackcraft.net)
Date: Sat Jan 22 2005 - 07:03:27 CST

Next message: Lars Kristan: "wchar_t (was RE: 32'nd bit & UTF-8)"

Previous message: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
In reply to: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
Next in thread: Jon Hanna: "RE: BOM in HTML (was Conformance (was UTF, BOM, etc))"
Maybe reply: Jon Hanna: "RE: BOM in HTML (was Conformance (was UTF, BOM, etc))"
Maybe reply: Jon Hanna: "RE: BOM in HTML (was Conformance (was UTF, BOM, etc))"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> As for the .htm, I have to admit I don't know what standards
> say. Frankly, I don't care. Whatever they say, they might be
> wrong. IMO, HTML files are plain text. Encoding issues are
> covered by the directives. Encoding could even be switched
> within that document. It already is. Up to the first
> directive, the encoding is ASCII. At least I would define it
> that way, don't know if it actually is. If the BOM is
> allowed, it should only be valid (if at all) up until the
> first directive. Opening a .htm file in text mode might then
> be a pain.

HTML files are documents whose encoding is generally stated out-of-band
(they are after all primarily used on the web).

HTML files can contain <meta /> elements that MAY be used to determine
encoding in the absence of such out of band information, and all the
characters in a valid <meta /> element about the encoding would be US-ASCII
and so would be identical to the US-ASCII in any encoding, including UTF-8
with which US-ASCII is forwards-compatible. (Strictly speaking the <meta />
element with a http-equiv attribute gives instructions that a server may
read to determine what HTTP headers it could send, and browsers, editors and
other user-agents MAY read these elements and act as if the appropriate
header had been sent but are free not to.

The current version, and all foreseen future versions, of HTML is a version
of XHTML which is an XML application, and hence XML declarations can be used
in the absence of out-of-band information.

Encoding *is not* switched in the middle of a document. However, the correct
encoding may not be known until some of the document has been already
processed, which may require some of it to be reprocessed. In the case of
XML declarations however no characters are allowed to appear before the
declaration in the document so with their will be no re-processing necessary
though it is possible that the declaration will trigger an error, if
previously processed characters (i.e. the start of the element itself and
possibly a BOM in one of the UTF encodings) are invalid in the encoding in
question.

XML encoded in UTF-8 may begin with a BOM (this was implied in the first
drafts of the XML spec and later explicitly decided on). I don't know of any
explicit mention of the BOM with regards to HTML4.01 and earlier.

Regards,
Jon Hanna
Work: <http://www.selkieweb.com/>
Play: <http://www.hackcraft.net/>
Chat: <irc://irc.freenode.net/selkie>

Next message: Lars Kristan: "wchar_t (was RE: 32'nd bit & UTF-8)"
Previous message: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
In reply to: Lars Kristan: "RE: Conformance (was UTF, BOM, etc)"
Next in thread: Jon Hanna: "RE: BOM in HTML (was Conformance (was UTF, BOM, etc))"
Maybe reply: Jon Hanna: "RE: BOM in HTML (was Conformance (was UTF, BOM, etc))"
Maybe reply: Jon Hanna: "RE: BOM in HTML (was Conformance (was UTF, BOM, etc))"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 07:08:59 CST