Re: pre-HTML5 and the BOM from Leif Halvard Silli on 2012-07-17 (Unicode Mail List Archive)

From: Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Tue, 17 Jul 2012 16:11:46 +0200

"Martin J. Dürst", Tue, 17 Jul 2012 18:49:47 +0900:
> On 2012/07/17 17:22, Leif Halvard Silli wrote:
>
>> And an argument was put forward in the WHATWG mailinglist
>> earlier tis year/end of previous year, that a page with strict ASCII
>> characters inside could still contain character entities/references for
>> characters outside ASCII.
>
> Of course they can. That's the whole point of using numeric character
> references. I'm rather surprised that this was even discussed in the
> context of HTML5.

And the question was whether such a page should default to be seen as
UTF-8 encoded.

>> For instance, early on in 'the Web', some
>> appeared to think that all non-ASCII had to be represented as entities.
>
> Yes indeed. There's still some such stuff around. It's mostly
> unnecessary, but it doesn't hurt.

Actually, above I described an example where it did hurt ... At least,
if the goal is that pages are interpreted as UTF-8.

I have discovered one browser where it does hurt more directly: In W3M,
the text browser, which is also included in Emacs. W3M doesn't handle
(all) entities. E.g. it renders å and å as an 'aa' instead
of as an 'å', for instance.

So it seems to me that it is always advantageous to type characters
directly as doing so allows for better character encoding detection in
case the encoding labels disappear (read: easier to pick up that the
page is UTF-8 encoded) and also works better in at least one browser.
It does, as well, make authors more aware of the entire encoding issue
since it means that the page has to be properly labeled in order to
work cross parsers.

-- 
Leif Halvard Silli

Received on Tue Jul 17 2012 - 09:18:08 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 17 2012 - 09:18:09 CDT