Re: pre-HTML5 and the BOM from Martin J. Dürst on 2012-07-17 (Unicode Mail List Archive)

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Tue, 17 Jul 2012 19:02:01 +0900

Hello Leif,

Sorry to be late with my answer.

On 2012/07/13 20:44, Leif Halvard Silli wrote:
> "Martin J. Dürst", Fri, 13 Jul 2012 18:17:05 +0900:
>> On 2012/07/13 0:12, Leif Halvard Silli wrote:
>>> Doug Ewell, Wed, 11 Jul 2012 09:12:46 -0600:
>>>
>>>> and people who want to create or modify UTF-8 files which will
>>>> be consumed by a process that is intolerant of the signature
>>>> should not use Notepad. That goes for HTML (pre-5) pages [snip]
>>>
>>> HTML5-parsers MUST support UTF-8. They do not need to support any other
>>> encoding. Pre-HTML5-parsers are not required to support the UTF-8
>>> encoding - or any other particular encoding.
>>
>> Up to here, that's indeed what the spec says, except for XHTML, which
>> is XML and therefore includes UTF-8 (and UTF-16) support, but my
>> guess is that you didn't include this.
>
> Right. I meant pre-HTML5 HTML as text/html. Not pre-HTML5 HTML as XML.
>
>>> But when they do support
>>> the UTF-8 encoding, they are, however, not permitted to be 'intolerant'
>>> of the BOM.
>>
>> Where does it say so?
>
> What is 'it'?

That pre-HTML5 (as text/html) browsers are not permitted to be
'intolerant' of the BOM.

> HTML5 tells how UAs should use BOM to decide the encoding. By
> pre-HTML5, I meant the 'text/html' MIME space, though I gave much
> weight to HTML4 ...
>
> I see that HTML4 for UTF-8 points to RFC2279,[1] which was silent about
> the UTF-8 BOM. Only with RFC3629 from 2003, is the UTF-8 BOM
> described.[3]

Yes exactly. In the RFC 2070 and HTML4 time-frame, nobody that I know
was thinking about a BOM for UTF-8. Only later BOMs at the start of
HTML4 started to turn up, and browser makers were surprised. Roughly the
same happened for XML. Early XML parsers didn't handle the BOM.

When Windows notepad started to use the BOM to distinguish between UTF-8
and "ANSI" (the local system legacy encoding), this BOM leaked into
HTML, and was difficult to stop. So XML got updated, and parsers started
to get updated, too.

> As for XML 1.0, then revision 2 from year 2000 appears to
> be the first time the XML spec describes the UTF-8 BOM.[4] The Appendix
> C 'profile' of XHTML 1.0 - which was issued year 2000 and revised 2002
> - is also part of the text/html MIME registration of June 2000.[5] The
> MIME contains a general quote of UTF-8 as preferred, but does not talk
> about the UTF-8 BOM. XHTML 1.0 itself strangely enough does not reflect
> much on whether XML's default encoding(s) with regard to serving XHTMLm
> as text/html.[6] Though, it does actually say, appendix C: [7]
> "Remember, however, that when the XML declaration is not included in a
> document, the document can only use the default character encodings
> UTF-8 or UTF-16." Here it does sound as if XHTML, even when served
> according to appendix C, should subject itself to XML's encoding rules.
>
> So, given the age of the documents, neither HTML4 from 1999 nor the
> 'text/html' MIME registration, does not permit anyone to be
> 'intolerant' of the UTF-8 BOM, but neither does it permit anyone to be
> 'tolerant' of it. It is silent on the issue.

You read silence as not taking sides, which makes sense from your
viewpoint. Knowing what implementations did (in a pre-1999 time-frame),
the idea of UTF-8 BOM just didn't really exist, so nobody thought about
mentioning it.

Regards, Martin.

> RFC3629 says that protocols may restrict usage of the BOM as a
> signature.[3] However, text/html does not do offer any such
> restrictions. If one sees HTML4 as as tied to RFC2279 as XML up until
> and including 4th revision was tied to specific versions of Unicode,
> then this has not changed. But would it not be natural to consider that
> text/html user agents currently has to consider RFC3629 as more
> normative than RFC2279? I do at least not think that user agents that
> want to be conforming pre-HTML5 user agents have any justification for
> ignoring the BOM.
>
> [1] http://www.w3.org/TR/html401/appendix/notes#h-B.2.1
> [2] http://tools.ietf.org/html/rfc2279
> [3] http://tools.ietf.org/html/rfc3629#section-6
> [4] http://www.w3.org/TR/2000/WD-xml-2e-20000814
> [5] http://tools.ietf.org/html/rfc2854
> [6] http://www.w3.org/TR/xhtml1/#C_9
> [7] http://www.w3.org/TR/xhtml1/#C_1
>
>>> Thus there is nothing special with regard to the UTF-8 BOM and
>>> pre-HTML5 HTML.
Received on Tue Jul 17 2012 - 05:03:49 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 17 2012 - 05:03:49 CDT