RE: cp1252 decoder implementation

From: Shawn Steele <>
Date: Mon, 19 Nov 2012 06:10:14 +0000

> What effort has been spent? This is not an either/or type of proposition.
> If we can agree that it's an improvement (albeit small), let's update the mapping.
> Is it much harder than I believe it is?

What if some application's treating it as undefined? And now the code page gets updated to say that it's a real mapping? Then someone uses the code point and causes the application to break, and then they point to the updated standard, and say they aren't complient.

> Internally the app is fully utf8, but must accept (poorly encoded) input from all over the web.

IMO, it's better to get that poorly encoded input to be correctly encoded.

A) If it really means CP 1252, it shouldn't really be using these code points, so defining these differently doesn't really solve anything.

B) If the input isn't working right because of this, then something's wrong with the input, so they need to fix that.

I don't think it's worth the app developer's time, or this list's time, trying to fix something that's such a severe edge case.

> cp1252 is one of the two encodings that a browser *must* implement, according to the html5 spec, so this is a very important encoding, second only to utf8.

If HTML 5 requires it because it's so common, so changing the definition of the behavior doesn't seem like a great idea.

> My essential point is that the latin1 mapping file specifies an encoding that will succeed with arbitrary binary input.

Ah, but this is all about text, not arbitrary binary input. Those 5 code points provide no value for text. They aren't used, shouldn't be used, and aren't very useful even if they were used. Expecting binary input to conform to a text encoding isn't a good idea.

By that logic, one would expect UTF-8 to accept arbitrary binary input. However 0x80, 0x80 needs to fail according to the standards, so even UTF-8 can't accept arbitrary binary input. If you need to transmit binary data, then send it in some non-text or appropriately encoded form.

Received on Mon Nov 19 2012 - 00:12:53 CST

This archive was generated by hypermail 2.2.0 : Mon Nov 19 2012 - 00:12:54 CST