UTF-8 isn't the default for HTML (was: xkcd: LTR)

From: Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Wed, 28 Nov 2012 18:26:56 +0100

Philippe Verdy, Wed, 28 Nov 2012 11:02:45 +0100:
> In this case, Firefox and IE should not even be able to render
> *any* XHTML page because it violates the HTML5 standard.

(1) The page in question (http://www.xn--elqus623b.net/XKCD/1137.html)
is (from a source code point of view) a pure XHTML page, and contains
no HTML-compatible methods for declaring the encoding. And therefore,
that page does indeed violate the HTML5 standard, with the result that
browsers are permitted to fall back to their built-in default encodings.

(2) According to XML, the XML prologue can be deleted for UTF-8 encoded
pages. And when it is deleted/omitted, XML parsers assume that the page
is UTF-8 encoded. And if you try that (that is: if you *do* delete the
XML prologue from that page), then you will see that the Unicorn
validator will *continue* to stamp that Web page as error free. This is
because the Unicorn validator only considers the rules for XML - it
doesn't consider the rules of HTML.

(4) Also, when you do delete the XML prologue, then not only Firefox
and IE will render the page in the "wrong" encoding, but even Safari.
However, Opera and Chrome will continue to render the page as UTF-8 due
to the UTF-8 sniffing that they cleverly have built in. Clearly, Opera
and Chrome's behaviour is the way to go.

(5) It is indeed backwards that the W3C Unicorn validator doesn't
inform its users when their pages fail to include a HTML-compatible
method for declaring the encoding. This suboptimal validation could
partly be related to libxml2, which Unicorn is partly based on. Because
- as it turns out - the command line tool xmllint (which is part of
libxml2) shows a very similar behaviour to that of Unicorn: It pays no
respect to the fact that the MIME type (or Content-Type:) is
'text/html' and not an XML MIME type. In fact, when you do delete the
XML prologue, Unicorn issues this warning (you must click to make it
visible): "No Character Encoding Found! Falling back to UTF-8." Which
is a quite confusing message to send given that HTML parser does not,
as their last resort, fall back to UTF-8.

-- 
leif halvard silli
Received on Wed Nov 28 2012 - 11:30:21 CST

This archive was generated by hypermail 2.2.0 : Wed Nov 28 2012 - 11:30:22 CST