Re: cp1252 decoder implementation

From: Buck Golemon <buck_at_yelp.com>
Date: Sun, 18 Nov 2012 21:49:41 -0800

On Sat, Nov 17, 2012 at 10:52 AM, Shawn Steele
<Shawn.Steele_at_microsoft.com>wrote:

> IMO this isn’t worth the effort being spent on it. MOST encodings have
> all sorts of interesting quirks, variations, OEM or App specific behavior,
> etc. These are a few code points that haven’t really caused much
> confusion, and other code pages are much more confusing (like the CJK ones
> in particular).
>

What effort has been spent? This is not an either/or type of proposition.
If we can agree that it's an improvement (albeit small), let's update the
mapping.
Is it much harder than I believe it is?

> I’d be much happier spending effort on getting apps to UTF-8 than trying
> to resolve esoteric quirks of legacy encodings.
>

This is not an app question, but an infrastructure question. Internally the
app is fully utf8, but must accept (poorly encoded) input from all over the
web. cp1252 is the right thing to use for those inputs, but (as currently
specified) it is not *guaranteed* to succeed (given that we're already
talking about questionable input), as the old latin1 is.

> Even if you get that CP perfect, someone’s gonna enter any of a bajillion
> characters on that page’s HTML 5 web form that’ll turn into ? at best.
>

Note that the chief values in question are URLs and POSTs from naively
coded api clients and crawlers (cough: bingbot). The value of "bajillion"
is then only 256, but I find this to be off-topic.

cp1252 is one of the two encodings that a browser *must* implement,
according to the html5 spec, so this is a very important encoding, second
only to utf8. This is not *yet* a legacy encoding, given the current state
of the web.

My essential point is that the latin1 mapping file specifies an encoding
that will succeed with arbitrary binary input.
If cp1252 is to be used as a replacement, it is desirable that it have this
same property.
This only necessitates defining those five bytes to be control codes, as
the w3c spec and the "bestfit" mapping already do.

> There is a simple, one-to-one mapping between 7-bit (and 8-bit) control
> codes and the Unicode control codes: every 7-bit (or 8-bit) control code is
> numerically equal to its corresponding Unicode code point.****
>
> ** **
>
> A one-to-one equivalency between bytes and unicode-points is exactly what
> is specified here, limited to the domain of "8-bit control codes".****
>
> ** **
>
> On Fri, Nov 16, 2012 at 9:48 PM, Philippe Verdy <verdy_p_at_wanadoo.fr>
> wrote:****
>
> If you are thinking about "byte values" you are working at the encoding
> scheme level (in fact another lower level which defines a protocol
> presentation layer, e.g. "transport syntaxes" in MIME). Unicode codepoints
> are conceptually not an encoding scheme, just a coded character set
> (independant of the encoding scheme).****
>
> ** **
>
> Separate the levels of abstraction and you'll be much more fine. Forget
> the apparent homonymies that exist between distinct layers of abstraction
> and use each standard in what it is designed for (including the Unicode
> "character/glyph model" which is not defining an encoding scheme).****
>
> ** **
>
> So don't say that there are one-for-one equivalences. This is wrong : the
> adaptation layer must exist between abstraction levels and between separate
> standards, but the Unicode standard does not specify them completely (with
> the only exception of standard UTF encodings schemes, which is just one
> possible adaptation across some abstraction levels, but is not made to
> adapt alone to other standards than what is in the Unicode standard itself).
> ****
>
> ** **
>
> ** **
>
> 2012/11/17 Buck Golemon <buck_at_yelp.com>****
>
> On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell <doug_at_ewellic.org> wrote:****
>
> Buck Golemon wrote:****
>
> Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and
> to map it to the equally-non-semantic U+81 ?
>
> This would allow systems that follow the html5 standard and use cp1252
> in place of latin1 to continue to be binary-faithful and reversible.****
>
> ** **
>
> This isn't quite as black-and-white as the question about Latin-1. If you
> are targeting HTML5, you are probably safe in treating an incoming 0x81
> (for example) as either U+0081 or U+FFFD, or throwing some kind of error.*
> ***
>
> ** **
>
> Why do you make this conditional on targeting html5?****
>
> ** **
>
> To me, replacement and error is out because it means the system loses data
> or completely fails where it used to succeed.****
>
> Currently there's no reasonable way for me to implement the U+0081 option
> other than inventing a new "cp1252+latin1" codec, which seems undesirable.
> ****
>
> ****
>
> HTML5 insists that you treat 8859-1 as if it were CP1252, so it no longer
> matters what the byte is in 8859-1.****
>
> ** **
>
> I feel like you skipped a step. The byte is 0x81 full stop. I agree that
> it doesn't matter how it's defined in latin1 (also it's not defined in
> latin1).****
>
> The section of the unicode standard that says control codes are equal to
> their unicode characters doesn't mention latin1. Should it?****
>
> I was under the impression that it meant any single-byte encoding, since
> it goes out of its way to talk about "8-bit" control codes.****
>
> ** **
>
> ** **
>
Received on Sun Nov 18 2012 - 23:54:37 CST

This archive was generated by hypermail 2.2.0 : Sun Nov 18 2012 - 23:54:37 CST