Re: Re: Do you know a tool to decode "UTF-8 twice"

From: Buck Golemon <buck_at_yelp.com>
Date: Mon, 28 Oct 2013 09:48:27 -0700

On Mon, Oct 28, 2013 at 6:06 AM, "Jörg Knappen" <jknappen_at_web.de> wrote:

> Hi Steffen,
>
> data aren't that easy. There are non-latin1-characters encoded in the UTF8
> part. I expect
> among others typographic apostrophes, polish characters, some mediaevalist
> characters like
> ũ (u with tilde). Maybe, there is also some greek inside, but I am not
> sure about that.
>
> --Jörg Knappen
>
> *Gesendet:* Montag, 28. Oktober 2013 um 12:34 Uhr
> *Von:* "Steffen \"Daode\" Nurpmeso" <sdaoden_at_gmail.com>
> *An:* "Jörg Knappen" <jknappen_at_web.de>
> *Cc:* unicode_at_unicode.org
> *Betreff:* Re: Do you know a tool to decode "UTF-8 twice"
> "Jörg Knappen" <jknappen_at_web.de> wrote:
> | Is there a ready made tool that decodes "UTF-8 twice" while keeping
> | UTF-8 proper in place?
>
> Isn't a shell script with a truly validating iconv(1) enough?
> This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run
>
> ?0[steffen_at_sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2
>
> As in
>
> for i in utf8.1 utf8.2; do
> if iconv -f utf8 -t latin1 < ${i} |
> iconv -f utf8 -t utf8 >/dev/null 2>&1; then
> echo ${i}: bummer, going home by one
> iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1
> else
> echo ${i}: valid UTF-8
> fi
> done
>
> i'll end up as
>
> ?0[steffen_at_sherwood tmp]$ sh utf8dec.sh
> utf8.1: valid UTF-8
> utf8.2: bummer, going home by one
> ?0[steffen_at_sherwood tmp]$
>
> Ciao,
>
> | --Jörg Knappen
>
> --steffen
>

Jörg: There's no ready-made tool, but it's easy to write in python.
I'll provide you a well-tested function in a few minutes.
Received on Mon Oct 28 2013 - 11:50:29 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 28 2013 - 11:50:30 CDT