Re: UTF-8 isn't the default for HTML (was: xkcd: LTR) from Philippe Verdy on 2012-11-29 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 29 Nov 2012 10:11:13 +0100

So we would be in a case where it's impossible to warranty full
compatiblity or interoperability between the two concurrent standards from
the same standard body, and promissing the best interoperoperability with
"past" flavors of HTML (those past flavors are still not in the "past"
given that two of them are definitely not deprecated for now but fully
recommended, and HTML5 is still with the "draft" status).

HTML5 would contradict with everyone else, and only HTML5.

But I still think that the discriminant factor for HTML5 is its exclusive
(and **mandatory**) document declaration: if it is absent for any reason
there's absolutely NO reason to continue using an HTML5 parser, and
browsers must then either:
- fallback to using another "legacy" parser, or
- use an HTML5 parser (if this is the only one you have) working in a more
lenient mode, to recognize at least the XML prolog and the legacy SGML
document declaration for HTML or XHTML), and at lest recognize the encoding
in the XML prolog when it is present.

This second option (more lenient parsing by the HTML5 parser) should be
documented and vecome part of this future standard (still not finalized).

In my opinion the XML parser is definitely not a "legacy" parser, it is
present in all browsers for lots of services and applications. And it is
even needed to support HTML5 in its XHTML serialization syntax (which is
explicitly supported).

For me, it is normal that the Unicorn validator does not integrate HTML5,
given its draft status. So there's still a separate validator (which is
also working in beta version, given the draft status of HTML5) which is
still not integrable in Unicorn.

But given the huge developments already made on the web with HTML5, it
becomes urgent to fix these interoperability issues, before the final
release of HTML5 : the existing major browsers are already modified
constantly to follow the state of this draft, it will not be difficult for
them to implement the missing interoperability rules and the sonner it will
be done, the sooner webdesigners will be guided. (And in this case, the
Beta "nu" validator of the W3C will start being integrable in Unicorn which
remains the best validator from everything ; "nu" cannot be trusted for
now, and it does not even return a "conformance logo" in its result, given
that conformance rules are still not fully tested and specified in HTML5).

HTML5 remains for now an important project, but is still not a standard by
itself.

2012/11/28 Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>

> Philippe Verdy, Wed, 28 Nov 2012 11:02:45 +0100:
> > In this case, Firefox and IE should not even be able to render
> > *any* XHTML page because it violates the HTML5 standard.
>
> (1) The page in question (http://www.xn--elqus623b.net/XKCD/1137.html)
> is (from a source code point of view) a pure XHTML page, and contains
> no HTML-compatible methods for declaring the encoding. And therefore,
> that page does indeed violate the HTML5 standard, with the result that
> browsers are permitted to fall back to their built-in default encodings.
>
> (2) According to XML, the XML prologue can be deleted for UTF-8 encoded
> pages. And when it is deleted/omitted, XML parsers assume that the page
> is UTF-8 encoded. And if you try that (that is: if you *do* delete the
> XML prologue from that page), then you will see that the Unicorn
> validator will *continue* to stamp that Web page as error free. This is
> because the Unicorn validator only considers the rules for XML - it
> doesn't consider the rules of HTML.
>
> (4) Also, when you do delete the XML prologue, then not only Firefox
> and IE will render the page in the "wrong" encoding, but even Safari.
> However, Opera and Chrome will continue to render the page as UTF-8 due
> to the UTF-8 sniffing that they cleverly have built in. Clearly, Opera
> and Chrome's behaviour is the way to go.
>
> (5) It is indeed backwards that the W3C Unicorn validator doesn't
> inform its users when their pages fail to include a HTML-compatible
> method for declaring the encoding. This suboptimal validation could
> partly be related to libxml2, which Unicorn is partly based on. Because
> - as it turns out - the command line tool xmllint (which is part of
> libxml2) shows a very similar behaviour to that of Unicorn: It pays no
> respect to the fact that the MIME type (or Content-Type:) is
> 'text/html' and not an XML MIME type. In fact, when you do delete the
> XML prologue, Unicorn issues this warning (you must click to make it
> visible): "No Character Encoding Found! Falling back to UTF-8." Which
> is a quite confusing message to send given that HTML parser does not,
> as their last resort, fall back to UTF-8.
> --
> leif halvard silli
>
>
Received on Thu Nov 29 2012 - 03:14:04 CST

This archive was generated by hypermail 2.2.0 : Thu Nov 29 2012 - 03:14:05 CST