Re: Aw: Re: Do you know a tool to decode "UTF-8 twice" from Steffen on 2013-10-28 (Unicode Mail List Archive)

From: Steffen <sdaoden_at_gmail.com>
Date: Mon, 28 Oct 2013 17:23:37 +0100

"Jörg Knappen" <jknappen_at_web.de> wrote:
| Hi Steffen,
|
| data aren't that easy. There are non-latin1-characters encoded in the
| UTF8 part. I expect

I see.. Fantastic, now i feel responsible to hack something
unless noone relieves me until tomorrow afternoon.
Sigh.

--steffen

attached mail follows:

Hi Steffen,

data aren't that easy. There are non-latin1-characters encoded in the UTF8 part. I expect

among others typographic apostrophes, polish characters, some mediaevalist characters like

ũ (u with tilde). Maybe, there is also some greek inside, but I am not sure about that.

--Jörg Knappen

Gesendet: Montag, 28. Oktober 2013 um 12:34 Uhr
Von: "Steffen \"Daode\" Nurpmeso" <sdaoden@gmail.com>
An: "Jörg Knappen" <jknappen@web.de>
Cc: unicode@unicode.org
Betreff: Re: Do you know a tool to decode "UTF-8 twice"

"Jörg Knappen" <jknappen@web.de> wrote:
| Is there a ready made tool that decodes "UTF-8 twice" while keeping
| UTF-8 proper in place?

Isn't a shell script with a truly validating iconv(1) enough?
This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run

?0[steffen@sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2

As in

for i in utf8.1 utf8.2; do
if iconv -f utf8 -t latin1 < ${i} |
iconv -f utf8 -t utf8 >/dev/null 2>&1; then
echo ${i}: bummer, going home by one
iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1
else
echo ${i}: valid UTF-8
fi
done

i'll end up as

?0[steffen@sherwood tmp]$ sh utf8dec.sh
utf8.1: valid UTF-8
utf8.2: bummer, going home by one
?0[steffen@sherwood tmp]$

Ciao,

| --Jörg Knappen

--steffen

Received on Mon Oct 28 2013 - 11:26:27 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 28 2013 - 11:26:28 CDT