Re: cp1252 decoder implementation from Buck Golemon on 2012-11-17 (Unicode Mail List Archive)

From: Buck Golemon <buck_at_yelp.com>
Date: Sat, 17 Nov 2012 08:34:48 -0800

> So don't say that there are one-for-one equivalences.

I was just quoting this section of the standard:
http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

> There is a simple, one-to-one mapping between 7-bit (and 8-bit) control
codes and the Unicode control codes: every 7-bit (or 8-bit) control code is
numerically equal to its corresponding Unicode code point.

A one-to-one equivalency between bytes and unicode-points is exactly what
is specified here, limited to the domain of "8-bit control codes".

On Fri, Nov 16, 2012 at 9:48 PM, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> If you are thinking about "byte values" you are working at the encoding
> scheme level (in fact another lower level which defines a protocol
> presentation layer, e.g. "transport syntaxes" in MIME). Unicode codepoints
> are conceptually not an encoding scheme, just a coded character set
> (independant of the encoding scheme).
>
> Separate the levels of abstraction and you'll be much more fine. Forget
> the apparent homonymies that exist between distinct layers of abstraction
> and use each standard in what it is designed for (including the Unicode
> "character/glyph model" which is not defining an encoding scheme).
>
> So don't say that there are one-for-one equivalences. This is wrong : the
> adaptation layer must exist between abstraction levels and between separate
> standards, but the Unicode standard does not specify them completely (with
> the only exception of standard UTF encodings schemes, which is just one
> possible adaptation across some abstraction levels, but is not made to
> adapt alone to other standards than what is in the Unicode standard itself).
>
>
>
> 2012/11/17 Buck Golemon <buck_at_yelp.com>
>
>> On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell <doug_at_ewellic.org> wrote:
>>
>>> Buck Golemon wrote:
>>>
>>> Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and
>>>> to map it to the equally-non-semantic U+81 ?
>>>>
>>>> This would allow systems that follow the html5 standard and use cp1252
>>>> in place of latin1 to continue to be binary-faithful and reversible.
>>>>
>>>
>>> This isn't quite as black-and-white as the question about Latin-1. If
>>> you are targeting HTML5, you are probably safe in treating an incoming 0x81
>>> (for example) as either U+0081 or U+FFFD, or throwing some kind of error.
>>
>>
>> Why do you make this conditional on targeting html5?
>>
>> To me, replacement and error is out because it means the system loses
>> data or completely fails where it used to succeed.
>> Currently there's no reasonable way for me to implement the U+0081 option
>> other than inventing a new "cp1252+latin1" codec, which seems undesirable.
>>
>>
>>> HTML5 insists that you treat 8859-1 as if it were CP1252, so it no
>>> longer matters what the byte is in 8859-1.
>>
>>
>> I feel like you skipped a step. The byte is 0x81 full stop. I agree that
>> it doesn't matter how it's defined in latin1 (also it's not defined in
>> latin1).
>> The section of the unicode standard that says control codes are equal to
>> their unicode characters doesn't mention latin1. Should it?
>> I was under the impression that it meant any single-byte encoding, since
>> it goes out of its way to talk about "8-bit" control codes.
>>
>
>
Received on Sat Nov 17 2012 - 10:38:21 CST

This archive was generated by hypermail 2.2.0 : Sat Nov 17 2012 - 10:38:22 CST