Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Asmus Freytag (
Date: Thu Nov 04 2010 - 23:47:30 CST

  • Next message: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    On 11/4/2010 5:46 PM, Doug Ewell wrote:
    > Markus Scherer wrote:
    >> While processing 16-bit Unicode text which is not assumed to be
    >> well-formed UTF-16, you can treat (decode) an unpaired surrogate as a
    >> mostly-inert surrogate code point. However, you cannot unambiguously
    >> encode a surrogate code point in 16-bit text (because you could not
    >> distinguish a sequence of lead+trail surrogate code points from one
    >> supplementary code point), and therefore it is not allowed to encode
    >> surrogate code points in any well-formed UTF-8/16/32. [All of this is
    >> discussed in The Unicode Standard, Chapter 3.]
    > I'm probably missing something here, but I don't agree that it's OK
    > for a consumer of UTF-16 to accept an unpaired surrogate without
    > throwing an error, or converting it to U+FFFD, or otherwise raising a
    > fuss. Unpaired surrogates are ill-formed, and have to be caught and
    > dealt with.

    The question is whether you want every library that handles strings
    perform the equivalent of a citizen's arrest, or whether you architect
    things that the gatekeepers (border control) police the data stream.

    During development, early and widespread error detection is helpful in
    debugging. After that, it's probably better to concentrate handling
    these errors, because that would tend to improve your options for
    implementing successful error recovery.

    Malformed data shouldn't get in and shouldn't get perpetuated, but in
    the general case, there should be a facility for "repairing" faulty
    data, wherever that is reasonably possible.

    In the context of uppercasing a string, for example, repair is not a
    reasonable option, neither is rejecting the string at that point - it
    should have been rejected / repaired much earlier.


    This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 23:56:26 CST