Re: data for cp1252

From: Buck Golemon <buck_at_yelp.com>
Date: Fri, 7 Dec 2012 17:48:12 -0800

> If you already have existing data in 1252 or a variation (and can’t tell
them apart), then nothing’s gained by making NEW requirements for 1252
which the old data won’t conform to.

Old latin1 documents can contain 0x81 and still be valid.
All browsers decode latin1 documents with cp1252.
In all cases, such a document would decode with a U+0081 character, with no
error.
It's impossible to implement such behavior with a unicode.org-compliant
decoder.

The added value lies in the fact that compliant decoders will be able to
decode such documents without error or loss of data, as all versions of
MSIE do.
This added value is admittedly small, as such odd documents will be very
rare and of questionable value, but the value is not nothing.
It removes an edge-case from all code paths which must deal with these
legacy encodings; there is no byte that is an error.
It's also correct. *All* browsers have this behavior. The W3C has found
this behavior to be correct. Opera at one point in time implemented
the current unicode.org cp1252 spec, but was forced to change to the W3C
spec by real-world requirements.

Is there *any* cp1252 decoder that you find to be canonical and implements
the specified behavior?

It feels like you're more interested in maintaining the status quo than in
correctness.
If this standard is unmaintained, please mark it as such, and we can all
move forward with the w3c spec.

On Fri, Dec 7, 2012 at 5:01 PM, Shawn Steele <Shawn.Steele_at_microsoft.com>wrote:

> *> *In contrast, bringing the cp1252 definition into line with real
> implementations and recommending UTF-8 for new developments are *not*mutually exclusive.
> ****
>
> ** **
>
> Exactly?****
>
> ** **
>
> If you already have existing data in 1252 or a variation (and can’t tell
> them apart), then nothing’s gained by making NEW requirements for 1252
> which the old data won’t conform to. Changing standards or behavior will
> only break things that already work.****
>
> ** **
>
> If you’re creating new data, it should be using UTF-8 to avoid these kinds
> of ambiguity.****
>
> ** **
>
> -Shawn****
>
> ** **
>
> On Fri, Dec 7, 2012 at 4:41 PM, Shawn Steele <Shawn.Steele_at_microsoft.com>
> wrote:****
>
> It’s a variation. The undefined codepoints in 1252 probably shouldn’t
> be used, and I can’t imagine that adding a code page helps anything, nor
> that changing an existing behavior helps anything. People really should be
> using UTF-8.****
>
> ****
>
> -Shawn****
>
> ****
>
> *From:* Buck Golemon [mailto:buck_at_yelp.com]
> *Sent:* Friday, December 7, 2012 4:34 PM
> *To:* Shawn Steele
> *Cc:* unicode****
>
>
> *Subject:* Re: data for cp1252****
>
> ****
>
> I've been told that bestfit1252 wasn't meant to redefine the cp1252
> mapping, although its first line declares "CODEPAGE 1252".****
>
> ****
>
> Is it a separate encoding or not?****
>
> ****
>
> If so, I'll submit a new "bestfit1252" to the python stdlib.****
>
> If not, I believe the cp1252 mapping needs brought into line.****
>
>
>
> ****
>
> On Fri, Dec 7, 2012 at 4:27 PM, Shawn Steele <Shawn.Steele_at_microsoft.com>
> wrote:****
>
> J****
>
> ****
>
> ** **
>
Received on Fri Dec 07 2012 - 19:49:48 CST

This archive was generated by hypermail 2.2.0 : Fri Dec 07 2012 - 19:49:48 CST