Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Markus Scherer (markus.icu@gmail.com)
Date: Thu Nov 04 2010 - 17:30:56 CST

  • Next message: Jim Monty: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    On Thu, Nov 4, 2010 at 2:52 PM, Jim Monty <jim.monty@yahoo.com> wrote:

    > > In other words, when you process 16-bit Unicode text it takes no effort
    > to
    > > handle unpaired surrogates, other than making sure that you only assemble
    > a
    > > supplementary code point when a lead surrogate is really followed by a
    > trail
    > > surrogate. Hence little need for cleanup functions -- but if you need
    > one,
    > > it's trivial to write one for UTF-16.
    >
    > Thank you! This is what I've always understood about the design of the
    > UTFs:
    > they're generally quite robust. One errant character doesn't make the whole
    > text
    > unusable. And in the case of transcoding from, say, UTF-16 to UTF-8, it's
    > reasonably straightforward to handle anomalies.
    >
    > So imagine my dismay when I wrote a trivial Perl script to convert a UTF-16
    > file
    > to a UTF-8 file and it died immediately on the first text file I tested
    > it on. I
    > got this error message:
    >
    > UTF-16:Malformed LO surrogate db82 at utf16-to-utf8.pl line 24,
    > <$utf16_dat_fh> line 119.
    >

    There is a difference between processing "16-bit Unicode text" and
    converting to UTF-8 or UTF-32, and even well-formed UTF-16.

    While processing 16-bit Unicode text which is not assumed to be well-formed
    UTF-16, you can treat (*de*code) an unpaired surrogate as a mostly-inert
    surrogate code point. However, you cannot *unambiguously* *en*code a
    surrogate code point in 16-bit text (because you could not distinguish a
    sequence of lead+trail surrogate code points from one supplementary code
    point), and therefore it is not allowed to encode surrogate code points in
    any *well-formed UTF*-8/16/32. [All of this is discussed in The Unicode
    Standard, Chapter 3.]

    So a converter is correct in treating an unpaired surrogate as an error. On
    the other hand...

    I guess I should appeal to the maintainer of the Perl core Encode module to
    > loosen the shackles a bit, eh?
    >

    Any conversion library should offer options for *how to deal with* errors.
    One way is to return an error, throw an exception, or equivalent. Another is
    to replace the offending sequence with some substitution character (usually
    U+FFFD when the target is a form of Unicode) and continue converting after
    that.

    If the conversion libraries you are using do not support this (I don't
    know), then you could ask for such options. Or use conversion libraries that
    do support such options (like ICU and Java).

    Best regards,
    markus



    This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 17:35:59 CST