Re: Aw: Re: Re: Re: Re: Do you know a tool to decode "UTF-8 twice" from Buck Golemon on 2013-10-30 (Unicode Mail List Archive)

From: Buck Golemon <buck_at_yelp.com>
Date: Wed, 30 Oct 2013 11:14:20 -0700

On Wed, Oct 30, 2013 at 9:56 AM, Frédéric Grosshans <
frederic.grosshans_at_gmail.com> wrote:

> Le 30/10/2013 17:32, "Jörg Knappen" a écrit :
>
> The data did not only contain latin-1 type mangling for the non-existent
>> Windows characters, but also sequences with the raw
>> C1 control characters for all of latin-1. So I had to do them, too.
>> The data weren't consistent at all, not even in their errors.
>> --Jörg Knappen
>>
> Your question helped me dust off and repair a non working python snippet I
> wrote for a similar problem. I was stuck with the mixing of windows-1252
> and latin1 controls (linked with a chinese characters). I write it below
> for reference.
>
> The python snippet below does not need sed, defines a function
> (unscramble(S)) which works on strings. The extension to files should be
> easy.
>
> Frédéric Grosshans
>
>
> def Step1Filter(S):
> for c in S :
> #works character/character because of the cp1252/latin1 ambiguity
> try :
> yield c.encode('cp1252')
> except UnicodeEncodeError :
> yield c.encode('latin1')
> #Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D)
>
> def unscramble(S):
> return b''.join(c for c in Step1Filter(S)).decode('utf8')
>
> PS: If anyone is interested in a licence, I consider this simple enough to
> be in the public domain an uncopyrightable.
>
>
This encoding you've implemented above is known as windows-1252 by the
whatwg and all browsers [1][2].
The implementation of cp1252 in python is instead a direct consequence of
the unicode.org definition [3].

[1] http://encoding.spec.whatwg.org/index-windows-1252.txt
[2] http://bukzor.github.io/encodings/cp1252.html
[3]
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
Received on Wed Oct 30 2013 - 13:16:51 CDT

This archive was generated by hypermail 2.2.0 : Wed Oct 30 2013 - 13:16:53 CDT