Re: xkcd: LTR from Philippe Verdy on 2012-11-28 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 28 Nov 2012 11:02:45 +0100

2012/11/28 Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>

> Philippe Verdy, Wed, 28 Nov 2012 04:50:06 +0100:
> >>> detects a violation of the required
> >>> extended "prolog" (sorry, the HTML5 document declaration, which is not
> a
> >>> valid "document declaration" for XHTML or for HTML4 or before or even
> for
> >>> SGML, due to the unspecified schema after the shema short name), it
> >>> should catch this exception to try another parser.
> >>
> >> There is no spec, that I am aware of, that says that it should do that.
> >
> > But this is in the scope of the HTML5 whose claimed purpose is to become
> > compatible with documents encoded in all previous flavors of HTML.
>
> I admit that understanding the meaning behind all the slogans about
> HTML5, can be be demanding. But the goal has all the time been to
> create a *single* HTML parser, and not to introduce switching between
> multiple HTML parsers. If you think otherwise, then my claim is that
> you have misunderstood.

In this case, Firefox and IE should not even be able to render *any* XHTML
page because it violates the HTML5 standard. It still attempts to recover
from it, recognizing a part of XHTML, but not an essential one : its very
basic XML prolog (and the XHTML document declaration), up to the point
where they start seeing the root element.

But then how can they claim supporting XHTML when they don't (and when the
XHTML syntax is still part of HTML5, which makes affirmations like yours –
"not to introduce switching between multiple HTML parsers" – very weak and
difficult to defend).

If the intent is to be able to parse all flavors of HTML (at least a basic
profile of them) with the same parser, then a behavior must be standardized
in HTML5 to correctly handle the possible presence of XML prologs and
standard SGML document declarations even if their contents are skipped and
ignored (notably here when it is used to specify that the document is
effectively encoded in UTF-8 and not cp1252, both encodings being supported
by HTML5 but without other compatibility problems when it is UTF-8)

But ignoring XHTML document declarations will have an impact on
compatibility, if there's an external or internal DTD and this should be
documented in HTML5 by limting the claims of compatibility (and suggesting
then another recovery mechanism for these unsupported parts of XHTML, using
a true XML parser in case of violation of the required HTML5 DOCTYPE
declaration).

For now this breaks the interoperability with the basic profile of XHTML,
more or less compatible with HTML4 including the deprecated elements (but
without the modular extension design, and without support of XML
namespaces).

Now the argument saying that "meta" elements may be used in the HTML
document header (to replace missing HTTP MIME headers), this contradicts
all what was done to deprecate this meta element usage before. And here
also, HTML5 is not clear about this change of position. And "meta" elements
won't make HTML parsers simpler to implement : they will need to reparse
the document from the beginning.

The XML prolog of XHTML is much simpler to parse than the meta element, and
can be parsed directly by the HTML5-only parser, which can as well as
accept at least the XHTML1 document declaration (without internal DTD) as
acceptable for this HTML5 parser (it should fail however if there's an
internal DTD or if the SGML catalog name is not one of those for HTML or
XHTML; it should just check the SGML catalog name partly ignoring the
flavor precision in its name, as there is no internal or external DTD
supported in HTML4 or lower; it should silently ignore the URL for an
external DTD in XHTML, including when XHTML is used as the alternate
serialization syntax for HTML5, even if this will cause some defined
entities defined in the external DTD not being replaced, but the result of
the HTML5 parser with undefined entities or entities defined differently
will be unpredictable).

If an implementation can support both parsers, the more compatible recovery
mode will be to use the XML parser, instead of using this simple heuristic.

Browsers already support multiple text-encoded document parsers, including
for HTML5 (Javascript, JSON, CSS, SVG/XML, P3P, URI...), plus binary
parsers for various media codecs (PNG, GIF, JPEG, WAV, ICO, OpenPDF...) if
they can embed them instead of using OS-supported codecs or plugins (MPEG,
Ogg...), and data codecs (compressors, encryptors,, archive formats... for
transport and security protocol layers referenced in URI schemes). What
else ?

In all popular browsers, the XML parser is still present, since long now,
to support XML requests (and lots of GUI or configuration features, such as
XUL in Firefox, VML in IE, external SVG images, local DB stores, support
library for third-party addons...), even if JSON requests are highly
preferred now, sometimes more secure, but much simpler and faster to parse
(and more compact in their serialization).
Received on Wed Nov 28 2012 - 04:06:58 CST

This archive was generated by hypermail 2.2.0 : Wed Nov 28 2012 - 04:06:59 CST