Re: cp1252 decoder implementation

From: Doug Ewell <doug_at_ewellic.org>
Date: Fri, 16 Nov 2012 21:57:48 -0700

Buck Golemon wrote:

>> This isn't quite as black-and-white as the question about Latin-1. If
>> you are targeting HTML5, you are probably safe in treating an
>> incoming 0x81 (for example) as either U+0081 or U+FFFD, or throwing
>> some kind of error.
>
> Why do you make this conditional on targeting html5?

Because WHATWG has seen fit to redefine "ISO-8859-1" as an alias on
"Windows-1252", and to create its own mapping tables and rules for
decoding, superseding all existing tables and documents created over the
years by vendors and SDOs:

http://dvcs.w3.org/hg/encoding/raw-file/tip/Overview.html

If you are targeting HTML5, you will probably be considered
nonconformant if you don't follow this document and associated tables.

If you are not targeting HTML5, then use the tables for ISO 8859-1 or
CP1252 (as appropriate) from the Unicode Standard.

>> HTML5 insists that you treat 8859-1 as if it were CP1252, so it no
>> longer matters what the byte is in 8859-1.
>
> I feel like you skipped a step. The byte is 0x81 full stop. I agree
> that it doesn't matter how it's defined in latin1 (also it's not
> defined in latin1).

Are you concerned about the mapping between Latin-1 and Unicode, or
about the control semantic of the character? The former is defined by
Unicode; the latter is defined by ISO 6429.

> The section of the unicode standard that says control codes are equal
> to their unicode characters doesn't mention latin1. Should it?

It applies to all ISO 8859-x parts.

--
Doug Ewell | Thornton, Colorado, USA
http://www.ewellic.org | @DougEwell ­ 
Received on Fri Nov 16 2012 - 23:02:27 CST

This archive was generated by hypermail 2.2.0 : Fri Nov 16 2012 - 23:02:28 CST