Re: Aw: Re: Do you know a tool to decode "UTF-8 twice" from Steffen on 2013-10-29 (Unicode Mail List Archive)

From: Steffen <sdaoden_at_gmail.com>
Date: Tue, 29 Oct 2013 11:09:50 +0100

Markus Scherer <markus.icu_at_gmail.com> wrote:
|Does "iconv -f utf8 -t latin1 < ${i} | iconv -f utf8 -t utf8" not work? It
|decodes one layer of UTF-8 and tests if the result is still in UTF-8, that
|seems right, and should work for all of Unicode.

It does work for

  ÄEIÖÜ
  ①
  𐇐
  𝄢
  🀂
  𐂂

but the error channel should possibly be suppressed all along the
way, as in

  FILE=some-file.txt
  (set +e;
  cat ${FILE} |
  iconv -f utf8 -t latin1 2>&1 |
  iconv -f utf8 -t utf8 >/dev/null 2>&1 &&
  echo It is likely that the file ${FILE} is encoded twice)

I mean, having a nice plain little C tool which simply iterates
over the data and checks for the two-octet sequences that encoding
UTF-8 into UTF-8 produces, checking the resulting sequences, too,
and only replacing original input with such decoded output if at
the end of the day the file consisted of at least one such
sequence would also be nice.

|markus

--steffen
Received on Tue Oct 29 2013 - 05:12:22 CDT

This archive was generated by hypermail 2.2.0 : Tue Oct 29 2013 - 05:12:23 CDT