Aw: Re: Re: Re: Do you know a tool to decode "UTF-8 twice"

From: Jörg Knappen <>
Date: Wed, 30 Oct 2013 16:13:53 +0100 (CET)
Thanks again!
My updated sed pattern generator now looks like:
r = range(0xa0, 0x170)
file = open("fixu8.sed", "w")
for i in r:
  pat1 = "s/"+unichr(i).encode("utf-8").decode("latin-1").encode("utf-8") + "/" + unichr(i).encode("utf-8") +"/g"
  print >>file, pat1
    pat2 = "s/"+unichr(i).encode("utf-8").decode("windows-1252").encode("utf-8") + "/" + unichr(i).encode("utf-8") +"/g"
    pat2 = pat1
  if (pat1 != pat2):
    print >>file, pat2
doing both latin-1 and windows-1252 mangled double utf-8.  This is probably enough for now, the rate of errors is low
enough for practical purposes (i.e., lower than the natural error rate introduced by typing errors)
--Jörg Knappen
Gesendet: Mittwoch, 30. Oktober 2013 um 15:34 Uhr
Von: "Frédéric Grosshans" <>
Betreff: Re: Aw: Re: Re: Do you know a tool to decode "UTF-8 twice"
Le 29/10/2013 17:15, "Jörg Knappen" a écrit :
> After running this script, a few more things were there:
> Non-normalised accents and some really strange
> encodings I could not really explain but rather guess their meanings, like
> s/Ãœ/Ü/g
> s/É/É/g
> s/AÌ€/À/g
> s/aÌ€/à/g
> s/EÌ€/È/g
> s/eÌ€/è/g
> s/„/„/g
> s/“/“/g
> s/ß/ß/g
> s/’/’/g
> s/Ä/Æ/g

It was probably not utf8 read as latin 1 and reencoded in utf8, but
utf_8 encoding read as Windows 1252 ( ) and reencoded as utf-8. Each
of the combination above contains a character absent in latin-1
(œ‰€žŸ™„), and some of them are only present in Windows-1252 (‰™„) and
not in Latin-15, the other possible mistake.

I'v e check that this is consistent with Ü É and ß but not with your Æ.
This double encoding would give Ä :
Ä=Win1252(C3 84)=110.00011 10.000100 = UTF8(00011 000100)=unicode 00C4
=Ä (and not Æ)


Received on Wed Oct 30 2013 - 10:15:49 CDT

This archive was generated by hypermail 2.2.0 : Wed Oct 30 2013 - 10:15:49 CDT