Re: pre-HTML5 and the BOM from Philippe Verdy on 2012-07-16 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 17 Jul 2012 03:40:37 +0200

2012/7/16 Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>:
> <html> element, then Chrome will sniff it as UTF-8 encoded. Whereas IE,
> Webkit, Opera, Firefox will default to ISO-8858-1/Windows-1252.

Actually ISO 885**9**-1. But we've also been told that, given the C1
controls are simply invalid for HTML, even if a site indicates
ISO-8859-1, it will be interpreted as Windows-1252 (meaning there were
will remain a few unassigned byte values that are invalid, causing the
HTML parser to try other encodings if they are found, but not UTF-8
which will be invalid there too and that could as well raise
exceptions). Most of these exceptions however will just be remapped to
the U+FFFD replacement character.

The support of legacy encodings is now more restrictive in HTML5 which
only supports UTF-8 and Windows-1252, plus a few other encodings
(ASCII is considered now an alias of Windows-1252, also for
compatibiluty reasons, even if strict US-ASCII resources could be
interpreted without changes as UTF-8) and require explicit encoding
(sniffing no longer works for something else as UTF-8 for its leading
BOM interpreted as a data signature and not as a character)
Received on Mon Jul 16 2012 - 20:44:59 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 16 2012 - 20:45:01 CDT