Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Fri Nov 05 2010 - 12:54:45 CST

Next message: Mark Davis ☕: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

Previous message: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Mark Davis ☕: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Mark Davis ☕: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 11/5/2010 7:02 AM, Doug Ewell wrote:
> Asmus Freytag<asmusf at ix dot netcom dot com> wrote:
>
>>> I'm probably missing something here, but I don't agree that it's OK
>>> for a consumer of UTF-16 to accept an unpaired surrogate without
>>> throwing an error, or converting it to U+FFFD, or otherwise raising a
>>> fuss. Unpaired surrogates are ill-formed, and have to be caught and
>>> dealt with.
>> The question is whether you want every library that handles strings
>> perform the equivalent of a citizen's arrest, or whether you architect
>> things that the gatekeepers (border control) police the data stream.
> If you can have upstream libraries check for unpaired surrogates at the
> time they convert UTF-16 to Unicode code points, then your point is well
> taken, because then the downstream libraries are no longer dealing with
> UTF-16, but with code points. Doing conversion and validation at
> different stages isn't a great idea; that's how character encodings get
> involved with security problems.

Note that I am careful not to suggest that (and I'm sure Markus isn't
either). "Handling" includes much more than code conversion. It includes
uppercasing, spell checking, sorting, searching, the whole lot.
Burdening every single one of those tasks with policing the integrity of
the encoding seems wasteful, and, as I tried to explain, puts the error
detection in a place where you'll be most likely prevented from doing
something useful in recovery.

Data import or code conversion routines are in a much better place,
architecturally, to allow the user meaningful options to deal with
corrupted data, from rejecting to attempts of repair.

However, some tasks, such as network identifier matching, are
security-sensitive and must re-validate their input, even if the data
has already passed a gate keeper routine such as a validating code
conversion routine.

> Corrigendum #1 closed the door on interpretation of invalid UTF-8
> sequences. I'm not sure why the approach to handling UTF-16 should be
> any different.
>
>

Next message: Mark Davis ☕: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Previous message: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Mark Davis ☕: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Reply: Mark Davis ☕: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 13:00:01 CST