Re: cp1252 decoder implementation from Buck Golemon on 2012-11-16 (Unicode Mail List Archive)

From: Buck Golemon <buck_at_yelp.com>
Date: Fri, 16 Nov 2012 19:54:23 -0800

On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell <doug_at_ewellic.org> wrote:

> Buck Golemon wrote:
>
> Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and
>> to map it to the equally-non-semantic U+81 ?
>>
>> This would allow systems that follow the html5 standard and use cp1252
>> in place of latin1 to continue to be binary-faithful and reversible.
>>
>
> This isn't quite as black-and-white as the question about Latin-1. If you
> are targeting HTML5, you are probably safe in treating an incoming 0x81
> (for example) as either U+0081 or U+FFFD, or throwing some kind of error.

Why do you make this conditional on targeting html5?

To me, replacement and error is out because it means the system loses data
or completely fails where it used to succeed.
Currently there's no reasonable way for me to implement the U+0081 option
other than inventing a new "cp1252+latin1" codec, which seems undesirable.

> HTML5 insists that you treat 8859-1 as if it were CP1252, so it no longer
> matters what the byte is in 8859-1.

I feel like you skipped a step. The byte is 0x81 full stop. I agree that it
doesn't matter how it's defined in latin1 (also it's not defined in latin1).
The section of the unicode standard that says control codes are equal to
their unicode characters doesn't mention latin1. Should it?
I was under the impression that it meant any single-byte encoding, since it
goes out of its way to talk about "8-bit" control codes.
Received on Fri Nov 16 2012 - 21:56:46 CST

This archive was generated by hypermail 2.2.0 : Fri Nov 16 2012 - 21:56:47 CST