Re: pre-HTML5 and the BOM from Leif Halvard Silli on 2012-07-17 (Unicode Mail List Archive)

From: Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Tue, 17 Jul 2012 21:35:22 +0200

Hi Martin,

"Martin J. Dürst", Tue, 17 Jul 2012 19:02:01 +0900:
> On 2012/07/13 20:44, Leif Halvard Silli wrote:
>> "Martin J. Dürst", Fri, 13 Jul 2012 18:17:05 +0900:
>>> On 2012/07/13 0:12, Leif Halvard Silli wrote:
>>>> Doug Ewell, Wed, 11 Jul 2012 09:12:46 -0600:

>>>> HTML5-parsers MUST support UTF-8. They do not need to support
>>>> any other encoding.

Error: I should have said "MUST support UTF-8 _and Windows-1252_".

… snip …
>>>> But when they do support the UTF-8 encoding, they are,
>>>> however, not permitted to be 'intolerant' of the BOM.
>>>
>>> Where does it say so?
… snip …
>> I see that HTML4 for UTF-8 points to RFC2279,[1] which was silent about
>> the UTF-8 BOM. Only with RFC3629 from 2003, is the UTF-8 BOM
>> described.[3]
>
> Yes exactly. In the RFC 2070 and HTML4 time-frame, nobody that I know
> was thinking about a BOM for UTF-8. Only later BOMs at the start of
> HTML4 started to turn up, and browser makers were surprised. Roughly
> the same happened for XML. Early XML parsers didn't handle the BOM.

The UTF-8 BOM can be beneficial to XML parsers too. Though, arguably,
mostly due to bugs in their XML implementation. Here are two test files
that shows it. Recommend trying in Safari/Chrome/Webkit - not sure if
IE9/IE10 has similar issues.

http://malform.no/testing/html5/bom/frame8
http://malform.no/testing/html5/bom/frame9

> When Windows notepad started to use the BOM to distinguish between
> UTF-8 and "ANSI" (the local system legacy encoding), this BOM leaked
> into HTML, and was difficult to stop. So XML got updated, and parsers
> started to get updated, too.

There are still some gotchas when it comes to how the (UTF-8) BOM is
interpreted in XML parsers. The XML test suite has very few test cases
that includes the BOM - may be that is part of the reason why.

But is the Windows Notepad really to blame? OK, it was leading the way.
But can we think of something that could have worked "better", in
praxis? And, no, I don't mean 'better' as in 'not leaking the BOM into
HTML'. I mean 'better' as in 'spreading the UTF-8 to the masses'.

… snip …
>> So, given the age of the documents, neither HTML4 from 1999 nor the
>> 'text/html' MIME registration, does not permit anyone to be
>> 'intolerant' of the UTF-8 BOM, but neither does it permit anyone to be
>> 'tolerant' of it. It is silent on the issue.
>
> You read silence as not taking sides, which makes sense from your
> viewpoint. Knowing what implementations did (in a pre-1999
> time-frame), the idea of UTF-8 BOM just didn't really exist, so
> nobody thought about mentioning it.

It is interesting to think about this history. And the fact that it was
unrealized. May be _that_ is due to the fact that, back then, then one
saw XML as the way forward - which meant that there was not the same
need for the UTF-8 BOM due to XML's default to UTF-8.

However, I think there are two ways to interpret "Pre-HTML5": Historic,
about 1998. Or current about choices today: 'this browser is fully
dedicated to HTML4 but does not intend to implement HTML5'. Pointing to
HTML4 for lack of BOM implementation, would be a very thin excuse.

-- 
Leif H Silli

Received on Tue Jul 17 2012 - 14:38:18 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 17 2012 - 14:38:25 CDT