Re: pre-HTML5 and the BOM from Leif Halvard Silli on 2012-07-13 (Unicode Mail List Archive)

From: Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Fri, 13 Jul 2012 13:44:42 +0200

"Martin J. Dürst", Fri, 13 Jul 2012 18:17:05 +0900:
> On 2012/07/13 0:12, Leif Halvard Silli wrote:
>> Doug Ewell, Wed, 11 Jul 2012 09:12:46 -0600:
>>
>>> and people who want to create or modify UTF-8 files which will
>>> be consumed by a process that is intolerant of the signature
>>> should not use Notepad. That goes for HTML (pre-5) pages [snip]
>>
>> HTML5-parsers MUST support UTF-8. They do not need to support any other
>> encoding. Pre-HTML5-parsers are not required to support the UTF-8
>> encoding - or any other particular encoding.
>
> Up to here, that's indeed what the spec says, except for XHTML, which
> is XML and therefore includes UTF-8 (and UTF-16) support, but my
> guess is that you didn't include this.

Right. I meant pre-HTML5 HTML as text/html. Not pre-HTML5 HTML as XML.

>> But when they do support
>> the UTF-8 encoding, they are, however, not permitted to be 'intolerant'
>> of the BOM.
>
> Where does it say so?

What is 'it'?

HTML5 tells how UAs should use BOM to decide the encoding. By
pre-HTML5, I meant the 'text/html' MIME space, though I gave much
weight to HTML4 ...

I see that HTML4 for UTF-8 points to RFC2279,[1] which was silent about
the UTF-8 BOM. Only with RFC3629 from 2003, is the UTF-8 BOM
described.[3] As for XML 1.0, then revision 2 from year 2000 appears to
be the first time the XML spec describes the UTF-8 BOM.[4] The Appendix
C 'profile' of XHTML 1.0 - which was issued year 2000 and revised 2002
- is also part of the text/html MIME registration of June 2000.[5] The
MIME contains a general quote of UTF-8 as preferred, but does not talk
about the UTF-8 BOM. XHTML 1.0 itself strangely enough does not reflect
much on whether XML's default encoding(s) with regard to serving XHTMLm
as text/html.[6] Though, it does actually say, appendix C: [7]
"Remember, however, that when the XML declaration is not included in a
document, the document can only use the default character encodings
UTF-8 or UTF-16." Here it does sound as if XHTML, even when served
according to appendix C, should subject itself to XML's encoding rules.

So, given the age of the documents, neither HTML4 from 1999 nor the
'text/html' MIME registration, does not permit anyone to be
'intolerant' of the UTF-8 BOM, but neither does it permit anyone to be
'tolerant' of it. It is silent on the issue.

RFC3629 says that protocols may restrict usage of the BOM as a
signature.[3] However, text/html does not do offer any such
restrictions. If one sees HTML4 as as tied to RFC2279 as XML up until
and including 4th revision was tied to specific versions of Unicode,
then this has not changed. But would it not be natural to consider that
text/html user agents currently has to consider RFC3629 as more
normative than RFC2279? I do at least not think that user agents that
want to be conforming pre-HTML5 user agents have any justification for
ignoring the BOM.

[1] http://www.w3.org/TR/html401/appendix/notes#h-B.2.1
[2] http://tools.ietf.org/html/rfc2279
[3] http://tools.ietf.org/html/rfc3629#section-6
[4] http://www.w3.org/TR/2000/WD-xml-2e-20000814
[5] http://tools.ietf.org/html/rfc2854
[6] http://www.w3.org/TR/xhtml1/#C_9
[7] http://www.w3.org/TR/xhtml1/#C_1

>> Thus there is nothing special with regard to the UTF-8 BOM and
>> pre-HTML5 HTML.

-- 
Leif Halvard Silli

Received on Fri Jul 13 2012 - 06:46:52 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 13 2012 - 06:46:53 CDT