Re: latin1 decoder implementation

From: Buck Golemon <buck_at_yelp.com>
Date: Fri, 16 Nov 2012 13:48:18 -0800

When decoding bytes to unicode using the "latin1" scheme, there are three
options for bytes not defined in the ISO-8859-1 standard.

1) Throw an error.
2) Insert the replacement glyph (fffd), indicating an unknown character.
3) Insert the unicode character with equal value. This means that
completely random bytes will always decode successfully.

The Python language currently implements option three. Is this correct?
There is an option to produce errors or replacements for encodings which
have undefined characters, but as implemented, latin1 currently defines
characters for all 256 bytes, so the option does nothing.

Restated, are the first 256 characters of unicode intended to be exactly
compatible with a latin1 codec?
This would imply that unicode has inserted character definitions into the
ISO-8859-1 standard.
Received on Fri Nov 16 2012 - 16:00:18 CST

This archive was generated by hypermail 2.2.0 : Fri Nov 16 2012 - 16:00:18 CST