Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Asmus Freytag (
Date: Fri Nov 05 2010 - 12:54:45 CST

  • Next message: Mark Davis ☕: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    On 11/5/2010 7:02 AM, Doug Ewell wrote:
    > Asmus Freytag<asmusf at ix dot netcom dot com> wrote:
    >>> I'm probably missing something here, but I don't agree that it's OK
    >>> for a consumer of UTF-16 to accept an unpaired surrogate without
    >>> throwing an error, or converting it to U+FFFD, or otherwise raising a
    >>> fuss. Unpaired surrogates are ill-formed, and have to be caught and
    >>> dealt with.
    >> The question is whether you want every library that handles strings
    >> perform the equivalent of a citizen's arrest, or whether you architect
    >> things that the gatekeepers (border control) police the data stream.
    > If you can have upstream libraries check for unpaired surrogates at the
    > time they convert UTF-16 to Unicode code points, then your point is well
    > taken, because then the downstream libraries are no longer dealing with
    > UTF-16, but with code points. Doing conversion and validation at
    > different stages isn't a great idea; that's how character encodings get
    > involved with security problems.

    Note that I am careful not to suggest that (and I'm sure Markus isn't
    either). "Handling" includes much more than code conversion. It includes
    uppercasing, spell checking, sorting, searching, the whole lot.
    Burdening every single one of those tasks with policing the integrity of
    the encoding seems wasteful, and, as I tried to explain, puts the error
    detection in a place where you'll be most likely prevented from doing
    something useful in recovery.

    Data import or code conversion routines are in a much better place,
    architecturally, to allow the user meaningful options to deal with
    corrupted data, from rejecting to attempts of repair.

    However, some tasks, such as network identifier matching, are
    security-sensitive and must re-validate their input, even if the data
    has already passed a gate keeper routine such as a validating code
    conversion routine.

    > Corrigendum #1 closed the door on interpretation of invalid UTF-8
    > sequences. I'm not sure why the approach to handling UTF-16 should be
    > any different.

    This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 13:00:01 CST