Re: Re: Re: Re: Re: Re: Do you know a tool to decode "UTF-8 twice" from Buck Golemon on 2014-01-29 (Unicode Mail List Archive)

From: Buck Golemon <buck_at_yelp.com>
Date: Wed, 29 Jan 2014 10:32:05 -0800

Jörg:

I case you want to see the previous discussions on the subject, here they
are:

* "data for cp1252"
http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0233".html<http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0233.html>
* "cp1252 decoder implementation"
http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0167.html
* tangential "latin1 decoder implementation"
http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0146.html

On Wed, Jan 29, 2014 at 10:21 AM, Buck Golemon <buck_at_yelp.com> wrote:

> Jörg:
>
> This is the definition of cp1252 used by the whatwg and all current
> browser implementations.
> I've appealed to the cp1252 maintainer to update the definition so that we
> don't have two competing standards, but I was rejected.
> I've been considering naming it cp1252-whatwg.
>
>
> On Wed, Jan 29, 2014 at 6:59 AM, "Jörg Knappen" <jknappen_at_web.de> wrote:
>
>> A little postscrptum to this old thread:
>>
>> On pyPi, there is now a codec available that handles the peculiar
>> definition of "latin1" inside mysql.
>> The package is called mysql-latin1-codec and features an encoding
>> consisting of cp1252 plus
>> 0x81, 0x8D, 0x8F, 0x90, 0x9D (the latter five characters are undefined in
>> the python codec for cp1252).
>>
>> https://pypi.python.org/pypi/mysql-latin1-codec/1.0
>>
>> --Jörg Knappen
>>
>> *Gesendet:* Mittwoch, 30. Oktober 2013 um 19:14 Uhr
>> *Von:* "Buck Golemon" <buck_at_yelp.com>
>> *An:* "Frédéric Grosshans" <frederic.grosshans_at_gmail.com>
>> *Cc:* "Jörg Knappen" <jknappen_at_web.de>, unicode <unicode_at_unicode.org>
>> *Betreff:* Re: Aw: Re: Re: Re: Re: Do you know a tool to decode "UTF-8
>> twice"
>>
>>
>> On Wed, Oct 30, 2013 at 9:56 AM, Frédéric Grosshans <
>> frederic.grosshans_at_gmail.com> wrote:
>>>
>>> Le 30/10/2013 17:32, "Jörg Knappen" a écrit :
>>>
>>>>
>>>> The data did not only contain latin-1 type mangling for the
>>>> non-existent Windows characters, but also sequences with the raw
>>>> C1 control characters for all of latin-1. So I had to do them, too.
>>>> The data weren't consistent at all, not even in their errors.
>>>> --Jörg Knappen
>>>
>>> Your question helped me dust off and repair a non working python
>>> snippet I wrote for a similar problem. I was stuck with the mixing of
>>> windows-1252 and latin1 controls (linked with a chinese characters). I
>>> write it below for reference.
>>>
>>> The python snippet below does not need sed, defines a function
>>> (unscramble(S)) which works on strings. The extension to files should be
>>> easy.
>>>
>>> Frédéric Grosshans
>>>
>>>
>>> def Step1Filter(S):
>>> for c in S :
>>> #works character/character because of the cp1252/latin1 ambiguity
>>> try :
>>> yield c.encode('cp1252')
>>> except UnicodeEncodeError :
>>> yield c.encode('latin1')
>>> #Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D)
>>>
>>> def unscramble(S):
>>> return b''.join(c for c in Step1Filter(S)).decode('utf8')
>>>
>>> PS: If anyone is interested in a licence, I consider this simple enough
>>> to be in the public domain an uncopyrightable.
>>>
>>
>> This encoding you've implemented above is known as windows-1252 by the
>> whatwg and all browsers [1][2].
>> The implementation of cp1252 in python is instead a direct consequence of
>> the unicode.org definition [3].
>>
>> [1] http://encoding.spec.whatwg.org/index-windows-1252.txt
>> [2] http://bukzor.github.io/encodings/cp1252.html
>> [3]
>> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
>>
>
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Wed Jan 29 2014 - 12:33:02 CST

This archive was generated by hypermail 2.2.0 : Wed Jan 29 2014 - 12:33:02 CST