RE: cp1252 decoder implementation from Shawn Steele on 2012-11-17 (Unicode Mail List Archive)

From: Shawn Steele <Shawn.Steele_at_microsoft.com>
Date: Sat, 17 Nov 2012 18:52:50 +0000

IMO this isn't worth the effort being spent on it. MOST encodings have all sorts of interesting quirks, variations, OEM or App specific behavior, etc. These are a few code points that haven't really caused much confusion, and other code pages are much more confusing (like the CJK ones in particular).

I'd be much happier spending effort on getting apps to UTF-8 than trying to resolve esoteric quirks of legacy encodings. Even if you get that CP perfect, someone's gonna enter any of a bajillion characters on that page's HTML 5 web form that'll turn into ? at best.

-Shawn

From: unicode-bounce_at_unicode.org [mailto:unicode-bounce_at_unicode.org] On Behalf Of Buck Golemon
Sent: Saturday, November 17, 2012 8:35 AM
To: verdy_p_at_wanadoo.fr
Cc: Doug Ewell; unicode
Subject: Re: cp1252 decoder implementation

> So don't say that there are one-for-one equivalences.

I was just quoting this section of the standard: http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf

> There is a simple, one-to-one mapping between 7-bit (and 8-bit) control codes and the Unicode control codes: every 7-bit (or 8-bit) control code is numerically equal to its corresponding Unicode code point.

A one-to-one equivalency between bytes and unicode-points is exactly what is specified here, limited to the domain of "8-bit control codes".

On Fri, Nov 16, 2012 at 9:48 PM, Philippe Verdy <verdy_p_at_wanadoo.fr<mailto:verdy_p_at_wanadoo.fr>> wrote:
If you are thinking about "byte values" you are working at the encoding scheme level (in fact another lower level which defines a protocol presentation layer, e.g. "transport syntaxes" in MIME). Unicode codepoints are conceptually not an encoding scheme, just a coded character set (independant of the encoding scheme).

Separate the levels of abstraction and you'll be much more fine. Forget the apparent homonymies that exist between distinct layers of abstraction and use each standard in what it is designed for (including the Unicode "character/glyph model" which is not defining an encoding scheme).

So don't say that there are one-for-one equivalences. This is wrong : the adaptation layer must exist between abstraction levels and between separate standards, but the Unicode standard does not specify them completely (with the only exception of standard UTF encodings schemes, which is just one possible adaptation across some abstraction levels, but is not made to adapt alone to other standards than what is in the Unicode standard itself).

2012/11/17 Buck Golemon <buck_at_yelp.com<mailto:buck_at_yelp.com>>
On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell <doug_at_ewellic.org<mailto:doug_at_ewellic.org>> wrote:
Buck Golemon wrote:
Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and
to map it to the equally-non-semantic U+81 ?

This would allow systems that follow the html5 standard and use cp1252
in place of latin1 to continue to be binary-faithful and reversible.

This isn't quite as black-and-white as the question about Latin-1. If you are targeting HTML5, you are probably safe in treating an incoming 0x81 (for example) as either U+0081 or U+FFFD, or throwing some kind of error.

Why do you make this conditional on targeting html5?

To me, replacement and error is out because it means the system loses data or completely fails where it used to succeed.
Currently there's no reasonable way for me to implement the U+0081 option other than inventing a new "cp1252+latin1" codec, which seems undesirable.

HTML5 insists that you treat 8859-1 as if it were CP1252, so it no longer matters what the byte is in 8859-1.

I feel like you skipped a step. The byte is 0x81 full stop. I agree that it doesn't matter how it's defined in latin1 (also it's not defined in latin1).
The section of the unicode standard that says control codes are equal to their unicode characters doesn't mention latin1. Should it?
I was under the impression that it meant any single-byte encoding, since it goes out of its way to talk about "8-bit" control codes.
Received on Sat Nov 17 2012 - 12:57:52 CST

This archive was generated by hypermail 2.2.0 : Sat Nov 17 2012 - 12:57:57 CST