Re: pre-HTML5 and the BOM from Leif Halvard Silli on 2012-07-18 (Unicode Mail List Archive)

From: Leif Halvard Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Wed, 18 Jul 2012 09:35:06 +0200

"Martin J. Dürst", Wed, 18 Jul 2012 11:00:42 +0900:
> On 2012/07/17 23:11, Leif Halvard Silli wrote:
>> "Martin J. Dürst", Tue, 17 Jul 2012 18:49:47 +0900:
>>> On 2012/07/17 17:22, Leif Halvard Silli wrote:

>>>> that a page with strict ASCII characters inside could still
>>>> contain character entities/references for characters outside ASCII.
>>>
>>> Of course they can. … snip …

>> And the question was whether such a page should default to be seen as
>> UTF-8 encoded.
>
> If I understand correctly, whether it's "seen as UTF-8 encoded" would
> be irrelevant when displaying the page, but might be relevant e.g.
> for form submission and the like?

Yes. There might be technical problems too: HTML5 browsers are, when
sniffing, asked to scan only a small beginning of the document. It
might be a thin reason to default to UTF-8 just because start of the
document contained no non-ASCII.

>> … one browser where it does hurt more directly: … W3M …
>> renders å and å as an 'aa' instead of as an 'å' …
>
> In a followup mail, you write:
>
>> To quote one W3m slogan: 'Its 8-bit support is second to none'. W3m is
>> a quite modern text browser. It is regularly updated, it can be used
>> with emacs, and is the text browser I would recommend.
>
> If W3M is updated so regularly, why isn't the å/å -> 'aa'
> bug simply fixed?

Fair point. I've made the W3m mailing list aware of it.

>> So it seems to me that it is always advantageous to type characters
>> directly as doing so allows for better character encoding detection in
>> case the encoding labels disappear (read: easier to pick up that the
>> page is UTF-8 encoded) and also works better in at least one browser.
>> It does, as well, make authors more aware of the entire encoding issue
>> since it means that the page has to be properly labeled in order to
>> work cross parsers.
>
> I agree that it general, characters should be encoded directly. There
> may be exceptions such as  , where in some editing environments,
> it's very helpful to see them explicitly.
>
> But a bug in a minor (or even a major) browser shouldn't be the
> reason for avoiding character entities and numeric character
> references.

Advising about how to code based on accidental bugs, is of course
hopeless.

> The best reason is simply that nobody should be using
> crutches as long as they can walk with their own legs.

Crutches, in that sense, is only about authoring convenience. And, of
course, it is a difference between using named and numeric character
references for a single non-ASCII letter as opposed to using it for all
of them. Nevertheless: I, as Web author, would perhaps skip that
convenience if I knew that doing so could improve e.g. HTML5 browser's
ability to sniff the encoding correctly when all other encoding info is
lost. If such sniffing can be an alternative to the BOM, and the BOM is
questionable, then why not mention it as a reason to avoid the crutches?

-- 
Leif Halvard Silli

Received on Wed Jul 18 2012 - 02:39:27 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 18 2012 - 02:39:30 CDT