Re: Do you know a tool to decode "UTF-8 twice"

From: Steffen <sdaoden_at_gmail.com>
Date: Mon, 28 Oct 2013 12:34:10 +0100

"Jörg Knappen" <jknappen_at_web.de> wrote:
 | Is there a ready made tool that decodes "UTF-8 twice" while keeping
 | UTF-8 proper in place?

Isn't a shell script with a truly validating iconv(1) enough?
This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run

  ?0[steffen_at_sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2

As in

  for i in utf8.1 utf8.2; do
    if iconv -f utf8 -t latin1 < ${i} |
        iconv -f utf8 -t utf8 >/dev/null 2>&1; then
      echo ${i}: bummer, going home by one
      iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1
    else
      echo ${i}: valid UTF-8
    fi
  done

i'll end up as

  ?0[steffen_at_sherwood tmp]$ sh utf8dec.sh
  utf8.1: valid UTF-8
  utf8.2: bummer, going home by one
  ?0[steffen_at_sherwood tmp]$

Ciao,

 | --Jörg Knappen

--steffen

attached mail follows:


I have a database with broken encoding, containing a lot of "UTF-8 twice"
(that infamous encoding that arises when UTF-8 is interpreted as latin-1 and
converted to UTF-8 again) encoding besides ASCII and UTF-8 proper.
 
Is there a ready made tool that decodes "UTF-8 twice" while keeping UTF-8 proper in place?
 
--Jörg Knappen
Received on Mon Oct 28 2013 - 06:36:36 CDT

This archive was generated by hypermail 2.2.0 : Mon Oct 28 2013 - 06:36:37 CDT