Re: pre-HTML5 and the BOM from Martin J. Dürst on 2012-07-17 (Unicode Mail List Archive)

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Wed, 18 Jul 2012 11:00:42 +0900

Hello Leif,

On 2012/07/17 23:11, Leif Halvard Silli wrote:
> "Martin J. Dürst", Tue, 17 Jul 2012 18:49:47 +0900:
>> On 2012/07/17 17:22, Leif Halvard Silli wrote:
>>
>>> And an argument was put forward in the WHATWG mailinglist
>>> earlier tis year/end of previous year, that a page with strict ASCII
>>> characters inside could still contain character entities/references for
>>> characters outside ASCII.
>>
>> Of course they can. That's the whole point of using numeric character
>> references. I'm rather surprised that this was even discussed in the
>> context of HTML5.
>
> And the question was whether such a page should default to be seen as
> UTF-8 encoded.

If I understand correctly, whether it's "seen as UTF-8 encoded" would be
irrelevant when displaying the page, but might be relevant e.g. for form
submission and the like?

> I have discovered one browser where it does hurt more directly: In W3M,
> the text browser, which is also included in Emacs. W3M doesn't handle
> (all) entities. E.g. it renderså andå as an 'aa' instead
> of as an 'å', for instance.

In a followup mail, you write:

> To quote one W3m slogan: 'Its 8-bit support is second to none'. W3m is
> a quite modern text browser. It is regularly updated, it can be used
> with emacs, and is the text browser I would recommend.

If W3M is updated so regularly, why isn't the å/å -> 'aa' bug
simply fixed?

> So it seems to me that it is always advantageous to type characters
> directly as doing so allows for better character encoding detection in
> case the encoding labels disappear (read: easier to pick up that the
> page is UTF-8 encoded) and also works better in at least one browser.
> It does, as well, make authors more aware of the entire encoding issue
> since it means that the page has to be properly labeled in order to
> work cross parsers.

I agree that it general, characters should be encoded directly. There
may be exceptions such as  , where in some editing environments,
it's very helpful to see them explicitly.

But a bug in a minor (or even a major) browser shouldn't be the reason
for avoiding character entities and numeric character references. The
best reason is simply that nobody should be using crutches as long as
they can walk with their own legs.

Regards, Martin.
Received on Tue Jul 17 2012 - 21:03:30 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 17 2012 - 21:03:31 CDT