Re: cp1252 decoder implementation

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Wed, 21 Nov 2012 18:25:54 +0900

On 2012/11/21 16:23, Peter Krefting wrote:
> Doug Ewell <doug_at_ewellic.org>:
>
>> Somewhat off-topic, I find it amusing that tolerance of "poorly
>> encoded" input is considered justification for changing the underlying
>> standards,

The encoding work at W3C, at least as far as I see it, is not an attempt
to redefine e.g. iso-8859-1 itself. To be blunt, it's just to make clear
that lots of Web pages out there are lying, and help browsers detect
this in an uniform way.

This does not mean that all other software has to do the same. Real
ISO-8859-1 will still be treated correctly by browsers. When you create
a Web page, if it's really iso-8859-1, then label it as such, but when
it's actually windows-1252, then label it as such. And make sure it
doesn't contain any undefined (or C1) codepoints. That way, it will
interoperate not only with browser, but also with other software.

Also, if you write any kind of tool, feel free to use the narrower
(real) definition, and to throw up errors. There are very few tools that
have to accept as wide a range of data and not throw an error as browsers.

>> when Internet Explorer has been flamed for years and years
>> for tolerating bad input.
>
> It's called adapting to reality, unfortunately. There are *a lot* of
> documents on the web labelled as being "iso-8859-1" and/or not labelled
> at all, which are using characters from the 1252 codepage. And since
> using the 1252 codepage to decode "proper" iso-8859-1 HTML documents
> does not hurt anyone (as HTML up to version 4 explicitly forbids the use
> of the control codes in the 0x80-0x9F range), that is what everyone does.
>
>>> One browser started to accept data in a form that it shouldn't have
>>> accepted. Sloppy content producers started to rely on this. Because
>>> the browser in question was the dominant browser, other browsers had
>>> to try and re-engineer and follow that browser, or just be ignored.
>> Evidently it's OK if W3C or Python does it, but not if Microsoft does it.
>
> Don't blame Microsoft here, it was Netscape (on Windows) that started
> it, by just mapping the iso-8859-1 input data to a windows-1252 encoded
> font output. The same pages that would work "fine" on Windows would show
> garbage on Unix, until it was patched to also display it as codepage
> 1252. Internet Explorer wasn't even published when this happened, and I
> can't remember now whether the first versions of it actually did this,
> or if it was bolted on later.

Thanks for this correction. Because it was windows-1252, I had assumed
it was Microsoft.

Regards, Martin.
Received on Wed Nov 21 2012 - 03:30:10 CST

This archive was generated by hypermail 2.2.0 : Wed Nov 21 2012 - 03:30:11 CST