Re: Re: Re: Do you know a tool to decode "UTF-8 twice" from Philippe Verdy on 2013-10-29 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Tue, 29 Oct 2013 20:00:36 +0100

2013/10/29 "Jörg Knappen" <jknappen_at_web.de>

> What really did the job for me was a generated sed script; for the
> generation I used the essentially the
> following python snippet and selected the ranges I suspected to be in my
> data:
>
> file = open("fixutf8.sed", "w")
> r = range(0xa0, 0x176)
> for i in r:
> print
> >>file,"s/"+unichr(i).encode("utf-8").decode("latin-1").encode("utf-8") +
> "/" + unichr(i).encode("utf-8") +"/g"
>

you should also retry with:

print
> >>file,"s/"+unichr(i).encode("utf-8").decode("windows-1252").encode("utf-8")
> + "/" + unichr(i).encode("utf-8") +"/g"
>

but you'll need to catch exceptions from decode("windows-1252"), where some
byte values (generated by UTF-8 encoding... but not as leading bytes) are
not allocated in windows-1252: 0x81, 0x83, 0x88, 0x90, 0x98; you may add 0x
AD (valid SOFT HYPHEN in windows-1252, but suspect in most cases).

These reencoded bytes (without the listed exceptions above when retrying
with windows-1252) still do not allow the leading UTF-8 byte before each of
them (which may have been doubly encoded by UTF-8) to validate alone once
decoded from the same double UTF-8 encoding. So this modified script will
not be enough to perform safe substitutions out of their leading context
(whicn may be one doubly-encoded UTF-8 leading byte, and up to 2 other
doubly-encoded
UTF-8 trailing bytes before the sequence you detect.
Received on Tue Oct 29 2013 - 14:03:03 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 29 2013 - 14:03:03 CDT