Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Doug Ewell (doug@ewellic.org)
Date: Fri Nov 05 2010 - 08:02:34 CST

  • Next message: Asmus Freytag: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    Asmus Freytag <asmusf at ix dot netcom dot com> wrote:

    >> I'm probably missing something here, but I don't agree that it's OK
    >> for a consumer of UTF-16 to accept an unpaired surrogate without
    >> throwing an error, or converting it to U+FFFD, or otherwise raising a
    >> fuss. Unpaired surrogates are ill-formed, and have to be caught and
    >> dealt with.
    >
    > The question is whether you want every library that handles strings
    > perform the equivalent of a citizen's arrest, or whether you architect
    > things that the gatekeepers (border control) police the data stream.

    If you can have upstream libraries check for unpaired surrogates at the
    time they convert UTF-16 to Unicode code points, then your point is well
    taken, because then the downstream libraries are no longer dealing with
    UTF-16, but with code points. Doing conversion and validation at
    different stages isn't a great idea; that's how character encodings get
    involved with security problems.

    Corrigendum #1 closed the door on interpretation of invalid UTF-8
    sequences. I'm not sure why the approach to handling UTF-16 should be
    any different.

    --
    Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
    RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­
    


    This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 08:07:39 CST