Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Jim Monty (jim.monty@yahoo.com)
Date: Wed Nov 03 2010 - 22:42:20 CST

Next message: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

Previous message: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Björn Höhrmann wrote:
> The simple solution to that is a small state machine that you
> put each byte through...

Thank you very much for your suggestions, Björn.

From your reply as well as from your Web page titled "Flexible and Economical
UTF-8 Decoder" http://bjoern.hoehrmann.de/utf-8/decoder/dfa/, it's obvious
you're exactly the right C programmer to have written just the utility I'm
looking for: a corrupted UTF-16 text reporting and repair utility. The purpose
of the utility would be to fix UTF-16 text that is mostly viable but nonetheless
broken due to one or more noncharacters or invalid surrogate-pair code units.
The rationale for such a utility is to make UTF-16 text that iconv, Perl and
other software chokes on viable and usable.

Unfortunately, I'm not a good enough programmer to write such a utility in C or
even Perl, the language I know best. Is this a project that interests you, by
chance?

I'm surprised I'm having difficulty finding an existing utility to repair broken
UTF-16 text. I thought this was something many programmers would need,
especially Web developers.

Thank you again for your thoughtful reply.

Jim Monty

Next message: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Previous message: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Bjoern Hoehrmann: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 03 2010 - 22:47:16 CST