Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Jim Monty (jim.monty@yahoo.com)
Date: Wed Nov 03 2010 - 22:42:20 CST

  • Next message: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    Björn Höhrmann wrote:
    > The simple solution to that is a small state machine that you
    > put each byte through...

    Thank you very much for your suggestions, Björn.

    From your reply as well as from your Web page titled "Flexible and Economical
    UTF-8 Decoder" http://bjoern.hoehrmann.de/utf-8/decoder/dfa/, it's obvious
    you're exactly the right C programmer to have written just the utility I'm
    looking for:  a corrupted UTF-16 text reporting and repair utility. The purpose
    of the utility would be to fix UTF-16 text that is mostly viable but nonetheless
    broken due to one or more noncharacters or invalid surrogate-pair code units.
    The rationale for such a utility is to make UTF-16 text that iconv, Perl and
    other software chokes on viable and usable.

    Unfortunately, I'm not a good enough programmer to write such a utility in C or
    even Perl, the language I know best. Is this a project that interests you, by
    chance?

    I'm surprised I'm having difficulty finding an existing utility to repair broken
    UTF-16 text. I thought this was something many programmers would need,
    especially Web developers.

    Thank you again for your thoughtful reply.

    Jim Monty



    This archive was generated by hypermail 2.1.5 : Wed Nov 03 2010 - 22:47:16 CST