Re: cp1252 decoder implementation

From: Buck Golemon <>
Date: Fri, 16 Nov 2012 19:36:21 -0800

I did this and was criticized for inventing my own "frankensteined"
encoding, although I believe it's conceptually consistent with the idea
that cp1252 is to be used as a superset of latin1.
It's true that what I wrote is not consistent with the unicode.orgdefinition:

The surrogateescape error handler doesn't seem exactly like what I want,
and I we're not on python3 yet, but "PEP 293 -- Codec Error Handling
Callbacks" seems usable.
I can define for myself a "c0c1" error handler that replaces bytes in those
ranges with the equal unicode character, per the standard.

> There is a simple, one-to-one mapping between 7-bit (and 8-bit) control
codes and the
> Unicode control codes: every 7-bit (or 8-bit) control code is numerically
equal to its corresponding Unicode code point.

I'm hoping, though, that we can come to agreement that defining the five
undefined cp1252 bytes as control characters, above, is beneficial, or at
least not harmful, in all cases.

On Fri, Nov 16, 2012 at 4:13 PM, <> wrote:

> Zitat von Buck Golemon <>:
> cp1252 (aka windows-1252) defines 27 characters which iso-8859-1 does not.
>> This leaves five bytes with undefined semantics.
>> Currently the python cp1252 decoder allows us to ignore/replace/error on
>> these bytes, but there's no facility for allowing these unknown bytes to
>> round-trip through the codec, as the latin1 codec does.
> That's not true: there are actually *two* facilities that allow exactly
> that.
> 1. you can write a new codec which round-trips these bytes through some
> characters,
> or
> 2. you can write an error handler that does such round-tripping. The
> surrogate-escape error handler was specifically designed to allow such
> round-tripping, see**peps/pep-0383/<>
> (not just for this codec, but for any codec).
> Regards,
> Martin
Received on Fri Nov 16 2012 - 21:41:47 CST

This archive was generated by hypermail 2.2.0 : Fri Nov 16 2012 - 21:41:48 CST