Re: cp1252 decoder implementation

From: Doug Ewell <>
Date: Tue, 20 Nov 2012 09:49:56 -0700

Buck Golemon <buck at yelp dot com> wrote:

> What effort has been spent? This is not an either/or type of
> proposition. If we can agree that it's an improvement (albeit small),
> let's update the mapping.
> Is it much harder than I believe it is?

ISO/IEC 8859-1 is, uh, an ISO/IEC standard. CP1252 is a Microsoft
corporate standard. One does not simply "update" someone else's
standard, the WHATWG document and mapping tables notwithstanding.

> This is not an app question, but an infrastructure question.
> Internally the app is fully utf8, but must accept (poorly encoded)
> input from all over the web. cp1252 is the right thing to use for
> those inputs, but (as currently specified) it is not *guaranteed* to
> succeed (given that we're already talking about questionable input),
> as the old latin1 is.

Somewhat off-topic, I find it amusing that tolerance of "poorly encoded"
input is considered justification for changing the underlying standards,
when Internet Explorer has been flamed for years and years for
tolerating bad input. Martin's quote below is hardly unique:

| One browser started to accept data in a form that it shouldn't have
| accepted. Sloppy content producers started to rely on this. Because
| the browser in question was the dominant browser, other browsers had
| to try and re-engineer and follow that browser, or just be ignored.

Evidently it's OK if W3C or Python does it, but not if Microsoft does

> My essential point is that the latin1 mapping file specifies an
> encoding that will succeed with arbitrary binary input.

If you have arbitrary binary input, you have no way of knowing that
CP1252 is the right way to map it. It could be CP1251 or ISO 8859-2 or
KOI8-R or KPS 9566 or ISO-2022-something, or a JPEG file, and in ANY of
those cases, your assumption of CP1252 will fail. And if it really is
CP1252, then you won't see those five unassigned code points anyway, as
Shawn said.

Doug Ewell | Thornton, Colorado, USA | @DougEwell ­
Received on Tue Nov 20 2012 - 10:54:50 CST

This archive was generated by hypermail 2.2.0 : Tue Nov 20 2012 - 10:54:52 CST