Re: Aw: Re: Re: Re: Do you know a tool to decode "UTF-8 twice" from Frédéric Grosshans on 2013-10-30 (Unicode Mail List Archive)

From: Frédéric Grosshans <frederic.grosshans_at_gmail.com>
Date: Wed, 30 Oct 2013 16:58:12 +0100

Le 30/10/2013 16:13, "Jörg Knappen" a écrit :
> Thanks again!
> My updated sed pattern generator now looks like:
> r = range(0xa0, 0x170)
> file = open("fixu8.sed", "w")
> for i in r:
> pat1 =
> "s/"+unichr(i).encode("utf-8").decode("latin-1").encode("utf-8") + "/"
> + unichr(i).encode("utf-8") +"/g"
> print >>file, pat1
> try:
> pat2 =
> "s/"+unichr(i).encode("utf-8").decode("windows-1252").encode("utf-8")
> + "/" + unichr(i).encode("utf-8") +"/g"
> except:
> pat2 = pat1
> if (pat1 != pat2):
> print >>file, pat2
> doing both latin-1 and windows-1252 mangled double utf-8. This is
> probably enough for now, the rate of errors is low
> enough for practical purposes (i.e., lower than the natural error rate
> introduced by typing errors)
>
Why to you do both latin1 and windows-1252 ? Windows-1252 is supposed to
be a superset of latin1, so it should be enough. Or is there a problem
with the few undefined bytes of windows-1252 (81, 8D, 8F, 90, 9D) ?

Frédéric
Received on Wed Oct 30 2013 - 11:00:06 CDT

This archive was generated by hypermail 2.2.0 : Wed Oct 30 2013 - 11:00:06 CDT