From: Jim Monty (jim.monty@yahoo.com)
Date: Wed Nov 03 2010 - 22:42:20 CST
Björn Höhrmann wrote:
> The simple solution to that is a small state machine that you
> put each byte through...
Thank you very much for your suggestions, Björn.
From your reply as well as from your Web page titled "Flexible and Economical
UTF-8 Decoder" http://bjoern.hoehrmann.de/utf-8/decoder/dfa/, it's obvious
you're exactly the right C programmer to have written just the utility I'm
looking for: a corrupted UTF-16 text reporting and repair utility. The purpose
of the utility would be to fix UTF-16 text that is mostly viable but nonetheless
broken due to one or more noncharacters or invalid surrogate-pair code units.
The rationale for such a utility is to make UTF-16 text that iconv, Perl and
other software chokes on viable and usable.
Unfortunately, I'm not a good enough programmer to write such a utility in C or
even Perl, the language I know best. Is this a project that interests you, by
chance?
I'm surprised I'm having difficulty finding an existing utility to repair broken
UTF-16 text. I thought this was something many programmers would need,
especially Web developers.
Thank you again for your thoughtful reply.
Jim Monty
This archive was generated by hypermail 2.1.5 : Wed Nov 03 2010 - 22:47:16 CST