Re: cp1252 decoder implementation from Peter Krefting on 2012-11-21 (Unicode Mail List Archive)

From: Peter Krefting <peter_at_opera.com>
Date: Wed, 21 Nov 2012 08:23:12 +0100

Doug Ewell <doug_at_ewellic.org>:

> Somewhat off-topic, I find it amusing that tolerance of "poorly encoded"
> input is considered justification for changing the underlying standards,
> when Internet Explorer has been flamed for years and years for
> tolerating bad input.

It's called adapting to reality, unfortunately. There are *a lot* of
documents on the web labelled as being "iso-8859-1" and/or not labelled at
all, which are using characters from the 1252 codepage. And since using
the 1252 codepage to decode "proper" iso-8859-1 HTML documents does not
hurt anyone (as HTML up to version 4 explicitly forbids the use of the
control codes in the 0x80-0x9F range), that is what everyone does.

>> One browser started to accept data in a form that it shouldn't have
>> accepted. Sloppy content producers started to rely on this. Because the
>> browser in question was the dominant browser, other browsers had to try
>> and re-engineer and follow that browser, or just be ignored.
> Evidently it's OK if W3C or Python does it, but not if Microsoft does it.

Don't blame Microsoft here, it was Netscape (on Windows) that started it,
by just mapping the iso-8859-1 input data to a windows-1252 encoded font
output. The same pages that would work "fine" on Windows would show
garbage on Unix, until it was patched to also display it as codepage 1252.
Internet Explorer wasn't even published when this happened, and I can't
remember now whether the first versions of it actually did this, or if it
was bolted on later.

-- 
\\// Peter Krefting - Core Technology Developer, Opera Software ASA

Received on Wed Nov 21 2012 - 01:25:57 CST

This archive was generated by hypermail 2.2.0 : Wed Nov 21 2012 - 01:25:58 CST