Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Thu Nov 04 2010 - 23:47:30 CST

Next message: Martin J. D�rst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

Previous message: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Martin J. D�rst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 11/4/2010 5:46 PM, Doug Ewell wrote:
> Markus Scherer wrote:
>
>> While processing 16-bit Unicode text which is not assumed to be
>> well-formed UTF-16, you can treat (decode) an unpaired surrogate as a
>> mostly-inert surrogate code point. However, you cannot unambiguously
>> encode a surrogate code point in 16-bit text (because you could not
>> distinguish a sequence of lead+trail surrogate code points from one
>> supplementary code point), and therefore it is not allowed to encode
>> surrogate code points in any well-formed UTF-8/16/32. [All of this is
>> discussed in The Unicode Standard, Chapter 3.]
>
> I'm probably missing something here, but I don't agree that it's OK
> for a consumer of UTF-16 to accept an unpaired surrogate without
> throwing an error, or converting it to U+FFFD, or otherwise raising a
> fuss. Unpaired surrogates are ill-formed, and have to be caught and
> dealt with.
>

The question is whether you want every library that handles strings
perform the equivalent of a citizen's arrest, or whether you architect
things that the gatekeepers (border control) police the data stream.

During development, early and widespread error detection is helpful in
debugging. After that, it's probably better to concentrate handling
these errors, because that would tend to improve your options for
implementing successful error recovery.

Malformed data shouldn't get in and shouldn't get perpetuated, but in
the general case, there should be a facility for "repairing" faulty
data, wherever that is reasonably possible.

In the context of uppercasing a string, for example, repair is not a
reasonable option, neither is rejecting the string at that point - it
should have been rejected / repaired much earlier.

A./

Next message: Martin J. D�rst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Previous message: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Martin J. D�rst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Nov 04 2010 - 23:56:26 CST