Re: Aw: Re: Re: Do you know a tool to decode "UTF-8 twice"

From: Steffen <sdaoden_at_gmail.com>
Date: Tue, 29 Oct 2013 21:07:53 +0100

"Jörg Knappen" <jknappen_at_web.de> wrote:
 | In my case, the iconv solution does not work. iconv throws an error
 | when confronted with both
 | "UTF-8 twice" and "UTF-8" in one string (and this exactly happened in
 | the database dump in question).

Painful condition; i'm not gonna ask how that could happen.
(I'm a groff(1) user now!)
Anyway, even the nice little C thing wouldn't have helped in this
scenario...

--steffen

attached mail follows:


Many thanks to all who answered my question.
 
In my case, the iconv solution does not work. iconv throws an error when confronted with both
"UTF-8 twice" and "UTF-8" in one string (and this exactly happened in the database dump in question).
 
What really did the job for me was a generated sed script; for the generation I used the essentially the
following python snippet and selected the ranges I suspected to be in my data:

 file = open("fixutf8.sed", "w") 
 r = range(0xa0, 0x176)
 for i in r:
   print >>file,"s/"+unichr(i).encode("utf-8").decode("latin-1").encode("utf-8") + "/" + unichr(i).encode("utf-8") +"/g"
 
After running this script, a few more things were there: Non-normalised accents and some really strange
encodings I could not really explain but rather guess their meanings, like
 
s/Ãœ/Ü/g
s/É/É/g
s/AÌ€/À/g
s/aÌ€/à/g
s/EÌ€/È/g
s/eÌ€/è/g
s/„/„/g
s/“/“/g
s/ß/ß/g
s/’/’/g
s/Ä/Æ/g
 
Greetings,
 
Jörg Knappen
 
Gesendet: Dienstag, 29. Oktober 2013 um 11:09 Uhr
Von: "Steffen \"Daode\" Nurpmeso" <sdaoden@gmail.com>
An: "Markus Scherer" <markus.icu@gmail.com>
Cc: Jörg <jknappen@web.de>, "Unicode Mailing List" <unicode@unicode.org>
Betreff: Re: Aw: Re: Do you know a tool to decode "UTF-8 twice"
Markus Scherer <markus.icu@gmail.com> wrote:
|Does "iconv -f utf8 -t latin1 < ${i} | iconv -f utf8 -t utf8" not work? It
|decodes one layer of UTF-8 and tests if the result is still in UTF-8, that
|seems right, and should work for all of Unicode.

It does work for

ÄEIÖÜ

𐇐
𝄢
🀂
𐂂

but the error channel should possibly be suppressed all along the
way, as in

FILE=some-file.txt
(set +e;
cat ${FILE} |
iconv -f utf8 -t latin1 2>&1 |
iconv -f utf8 -t utf8 >/dev/null 2>&1 &&
echo It is likely that the file ${FILE} is encoded twice)

I mean, having a nice plain little C tool which simply iterates
over the data and checks for the two-octet sequences that encoding
UTF-8 into UTF-8 produces, checking the resulting sequences, too,
and only replacing original input with such decoded output if at
the end of the day the file consisted of at least one such
sequence would also be nice.

(At least it would integrate better into my workflow than
some graphical JAVA ©® written by assembler-aware beautes :))

|markus

--steffen

 
Received on Tue Oct 29 2013 - 15:10:42 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 29 2013 - 15:10:44 CDT