Re: pre-HTML5 and the BOM from Leif Halvard Silli on 2012-07-16 (Unicode Mail List Archive)

From: Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Mon, 16 Jul 2012 17:06:57 +0200

Doug Ewell, Sat, 14 Jul 2012 15:14:10 -0600:
> Philippe Verdy wrote:
>
>> It would break if the only place where to place a BOM is just the
>> start of a file. But as I propose, we allow BOMs to occur anywhere to
>> specify which encoding to use to decode what follows each one, even
>> shell scripts would work [ snip ]

> U+FEFF is specifically defined as having the BOM semantic only when
> it appears at the beginning of the file or stream. Everywhere else,
> it can have only the ZWNBSP semantic.

True. That said: Of the Web browsers in current use, Chrome is the very
best (read: most aggressive) at UTF-8 sniffing. The others hardly sniff
anything but for the BOM. For example, if you do an UTF-8 encoded page
which contains nothing but ASCII - except a U+FEFF character (or any
other non-ASCII character) inside the class="" attribute of e.g. the
<html> element, then Chrome will sniff it as UTF-8 encoded. Whereas IE,
Webkit, Opera, Firefox will default to ISO-8858-1/Windows-1252.

So, in a way, the ZWNBSP - or any other non-ASCII character (it would
in fact be better to use U+200B, to reserve the U+FEFF for its
designated BOM purpose) could serve as a UTF-8 "sniff character" not
only when it is the first character of the document, but also elsewhere
in documents. And this already happens ...

(May be we see here a reflection of how Chrome is colored by its
owner's role as a giant social media content producer/facilitator,
whereas the other browser vendors are too much stuck in their
back-compatibility mantra.)

-- 
Leif Halvard Silli

Received on Mon Jul 16 2012 - 14:58:25 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 16 2012 - 15:00:53 CDT