Re: cp1252 decoder implementation from Buck Golemon on 2012-11-18 (Unicode Mail List Archive)

From: Buck Golemon <buck_at_yelp.com>
Date: Sun, 18 Nov 2012 21:51:30 -0800

I find these to be true statements, but I don't see how they support or
refute that which came before.

On Sun, Nov 18, 2012 at 3:58 PM, Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> The same chapter makes a normative reference to ISO/IEC 2022 for C0
> controls, it does not say that this concerns ISO/IEC 8859 (which does not
> reference itself ISO/IEC 2022 as being normative, but only informational
> just to day that it is compatible with it, as well as with ISO 6429, and a
> wide range of other international or national norms and various private
> standards, but not all of them : e.g. the VISCII national standard is not
> compatible with ISO/IEC 2022).
>
>
>
> 2012/11/17 Buck Golemon <buck_at_yelp.com>
>
>> > So don't say that there are one-for-one equivalences.
>>
>> I was just quoting this section of the standard:
>> http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
>>
>> > There is a simple, one-to-one mapping between 7-bit (and 8-bit) control
>> codes and the Unicode control codes: every 7-bit (or 8-bit) control code is
>> numerically equal to its corresponding Unicode code point.
>>
>> A one-to-one equivalency between bytes and unicode-points is exactly what
>> is specified here, limited to the domain of "8-bit control codes".
>>
>>
>> On Fri, Nov 16, 2012 at 9:48 PM, Philippe Verdy <verdy_p_at_wanadoo.fr>wrote:
>>
>>> If you are thinking about "byte values" you are working at the encoding
>>> scheme level (in fact another lower level which defines a protocol
>>> presentation layer, e.g. "transport syntaxes" in MIME). Unicode codepoints
>>> are conceptually not an encoding scheme, just a coded character set
>>> (independant of the encoding scheme).
>>>
>>> Separate the levels of abstraction and you'll be much more fine. Forget
>>> the apparent homonymies that exist between distinct layers of abstraction
>>> and use each standard in what it is designed for (including the Unicode
>>> "character/glyph model" which is not defining an encoding scheme).
>>>
>>> So don't say that there are one-for-one equivalences. This is wrong :
>>> the adaptation layer must exist between abstraction levels and between
>>> separate standards, but the Unicode standard does not specify them
>>> completely (with the only exception of standard UTF encodings schemes,
>>> which is just one possible adaptation across some abstraction levels, but
>>> is not made to adapt alone to other standards than what is in the Unicode
>>> standard itself).
>>>
>>>
>>>
>>> 2012/11/17 Buck Golemon <buck_at_yelp.com>
>>>
>>>> On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell <doug_at_ewellic.org> wrote:
>>>>
>>>>> Buck Golemon wrote:
>>>>>
>>>>> Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and
>>>>>> to map it to the equally-non-semantic U+81 ?
>>>>>>
>>>>>> This would allow systems that follow the html5 standard and use cp1252
>>>>>> in place of latin1 to continue to be binary-faithful and reversible.
>>>>>>
>>>>>
>>>>> This isn't quite as black-and-white as the question about Latin-1. If
>>>>> you are targeting HTML5, you are probably safe in treating an incoming 0x81
>>>>> (for example) as either U+0081 or U+FFFD, or throwing some kind of error.
>>>>
>>>>
>>>> Why do you make this conditional on targeting html5?
>>>>
>>>> To me, replacement and error is out because it means the system loses
>>>> data or completely fails where it used to succeed.
>>>> Currently there's no reasonable way for me to implement the U+0081
>>>> option other than inventing a new "cp1252+latin1" codec, which seems
>>>> undesirable.
>>>>
>>>>
>>>>> HTML5 insists that you treat 8859-1 as if it were CP1252, so it no
>>>>> longer matters what the byte is in 8859-1.
>>>>
>>>>
>>>> I feel like you skipped a step. The byte is 0x81 full stop. I agree
>>>> that it doesn't matter how it's defined in latin1 (also it's not defined in
>>>> latin1).
>>>> The section of the unicode standard that says control codes are equal
>>>> to their unicode characters doesn't mention latin1. Should it?
>>>> I was under the impression that it meant any single-byte encoding,
>>>> since it goes out of its way to talk about "8-bit" control codes.
>>>>
>>>
>>>
>>
>
Received on Sun Nov 18 2012 - 23:54:36 CST

This archive was generated by hypermail 2.2.0 : Sun Nov 18 2012 - 23:54:37 CST