Re: pre-HTML5 and the BOM from Martin J. Dürst on 2012-07-17 (Unicode Mail List Archive)

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Wed, 18 Jul 2012 10:05:40 +0900

Hello Leif,

On 2012/07/18 4:35, Leif Halvard Silli wrote:

> But is the Windows Notepad really to blame?

Pretty much so. There may have been other products from Microsoft that
also did it, but with respect to forcing browsers and XML parsers to
accept an UTF-8 BOM as a signature, Notepad was definitely the main
cause, by far.

> OK, it was leading the way.
> But can we think of something that could have worked "better", in
> praxis? And, no, I don't mean 'better' as in 'not leaking the BOM into
> HTML'. I mean 'better' as in 'spreading the UTF-8 to the masses'.

UTF-8 is easy and cheap to detect heuristically. It takes a bit more
work to scan the whole file than to just look at the first few bytes,
but then I don't think anybody is/was editing 1MB files in Notepad. So
the BOM/signature is definitely not the reason that UTF-8 spread on the
Web and elsewhere.

The spread of UTF-8 is due to its strict US-ASCII compatibility. Every
US-ASCII character/byte represents the same character, and only that
character, in UTF-8. A plain ASCII file is an UTF-8 file. If
syntax-significant characters are ASCII, then (close to) nothing may
need to change when moving from a legacy encoding to UTF-8. On top of
that, character synchronization is very easy because leading bytes and
trailing bytes have strictly separate values. From that viewpoint, the
BOM is a problem rather than a solution.

> … snip …
>>> So, given the age of the documents, neither HTML4 from 1999 nor the
>>> 'text/html' MIME registration, does not permit anyone to be
>>> 'intolerant' of the UTF-8 BOM, but neither does it permit anyone to be
>>> 'tolerant' of it. It is silent on the issue.
>>
>> You read silence as not taking sides, which makes sense from your
>> viewpoint. Knowing what implementations did (in a pre-1999
>> time-frame), the idea of UTF-8 BOM just didn't really exist, so
>> nobody thought about mentioning it.
>
> It is interesting to think about this history. And the fact that it was
> unrealized. May be _that_ is due to the fact that, back then, then one
> saw XML as the way forward - which meant that there was not the same
> need for the UTF-8 BOM due to XML's default to UTF-8.
>
> However, I think there are two ways to interpret "Pre-HTML5": Historic,
> about 1998. Or current about choices today: 'this browser is fully
> dedicated to HTML4 but does not intend to implement HTML5'. Pointing to
> HTML4 for lack of BOM implementation, would be a very thin excuse.

I think that a browser fully dedicated to HTML4 but not intending to
implement HTML5 will eventually die out. If it exists today, it would
indeed be reasonable to accept the BOM. But that's not because reading
the spec(s) leads to that as the only conclusion, it's because there's
content out there that starts with a BOM.

Regards, Martin.
Received on Tue Jul 17 2012 - 20:13:04 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 17 2012 - 20:13:06 CDT