Re: cp1252 decoder implementation

From: Philippe Verdy <>
Date: Sat, 17 Nov 2012 06:48:03 +0100

If you are thinking about "byte values" you are working at the encoding
scheme level (in fact another lower level which defines a protocol
presentation layer, e.g. "transport syntaxes" in MIME). Unicode codepoints
are conceptually not an encoding scheme, just a coded character set
(independant of the encoding scheme).

Separate the levels of abstraction and you'll be much more fine. Forget the
apparent homonymies that exist between distinct layers of abstraction and
use each standard in what it is designed for (including the Unicode
"character/glyph model" which is not defining an encoding scheme).

So don't say that there are one-for-one equivalences. This is wrong : the
adaptation layer must exist between abstraction levels and between separate
standards, but the Unicode standard does not specify them completely (with
the only exception of standard UTF encodings schemes, which is just one
possible adaptation across some abstraction levels, but is not made to
adapt alone to other standards than what is in the Unicode standard itself).

2012/11/17 Buck Golemon <>

> On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell <> wrote:
>> Buck Golemon wrote:
>> Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and
>>> to map it to the equally-non-semantic U+81 ?
>>> This would allow systems that follow the html5 standard and use cp1252
>>> in place of latin1 to continue to be binary-faithful and reversible.
>> This isn't quite as black-and-white as the question about Latin-1. If you
>> are targeting HTML5, you are probably safe in treating an incoming 0x81
>> (for example) as either U+0081 or U+FFFD, or throwing some kind of error.
> Why do you make this conditional on targeting html5?
> To me, replacement and error is out because it means the system loses data
> or completely fails where it used to succeed.
> Currently there's no reasonable way for me to implement the U+0081 option
> other than inventing a new "cp1252+latin1" codec, which seems undesirable.
>> HTML5 insists that you treat 8859-1 as if it were CP1252, so it no longer
>> matters what the byte is in 8859-1.
> I feel like you skipped a step. The byte is 0x81 full stop. I agree that
> it doesn't matter how it's defined in latin1 (also it's not defined in
> latin1).
> The section of the unicode standard that says control codes are equal to
> their unicode characters doesn't mention latin1. Should it?
> I was under the impression that it meant any single-byte encoding, since
> it goes out of its way to talk about "8-bit" control codes.
Received on Fri Nov 16 2012 - 23:53:23 CST

This archive was generated by hypermail 2.2.0 : Fri Nov 16 2012 - 23:53:29 CST