RE: latin1 decoder implementation

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Fri, 16 Nov 2012 22:28:39 +0000

The first 256 characters of the Unicode Standard *are* compatible with ISO/IEC 8859-1 (Latin-1), but you need to distinguish what happens for the graphic characters from what happens for the control codes.

ISO 8859-1 defines *graphic* characters in the ranges 0x20..0x7E, 0xA0..0xFF. Those are exactly identical to the Unicode characters at the respective code points.

ISO 8859-1 does *not* define control code usage, in the ranges 0x00..0x1F, 0x7F..0x9F. What that standard says is:

"The shaded positions in the code table [i.e. 0x00..0x1F, 0x7F..0x9F] correspond to bit combinations that do not represent graphic characters. Their use is outside the scope of ISO/IEC 8859; it is specified in other International Standards, for example ISO/IEC 6429."

What character set conversions for ISO 8859 character encodings [almost] all currently assume is that control code usage for the C0 set (0x00..0x1F, 0x7F) and the C1 set (0x80..0x9F) correspond to the control functions defined by ISO/IEC 6429. Which is also what the Unicode Standard implicitly assumes for U+0000..U+001F, U+007F..U+009F. So one-to-one conversions of the control codes is the correct thing to do. Even in the occasional cases where data using other control function conventions besides ISO 6429 is converted, the control code values are preserved through conversion to Unicode this way.

So, yes, Python is correct in converting all 256 values 0x00..0xFF in Latin-1 data to U+0000..U+00FF in Unicode.

But no, this does *not* imply that the Unicode Standard has inserted character definitions into ISO/IEC 8859-1.

--Ken

Restated, are the first 256 characters of unicode intended to be exactly compatible with a latin1 codec?
This would imply that unicode has inserted character definitions into the ISO-8859-1 standard.
Received on Fri Nov 16 2012 - 16:29:46 CST

This archive was generated by hypermail 2.2.0 : Fri Nov 16 2012 - 16:29:47 CST