Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Martin J. Dürst (duerst@it.aoyama.ac.jp)
Date: Fri Nov 05 2010 - 02:56:57 CST

Next message: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

Previous message: Asmus Freytag: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2010/11/05 2:46, Markus Scherer wrote:

> 16-bit Unicode is convenient in that when you find an unpaired surrogate
> (that is, it's not well-formed UTF-16) you can usually just treat it like a
> surrogate code point which normally has default properties much like an
> unassigned code point or noncharacter. It case-maps to itself, normalizes to
> itself, has default Unicode property values (except for the general
> category), etc.

Well, yes, you can handle it that way, but that's pretty much GIGO
(garbage in, garbage out) and dumping the problem on the next
person/software downwards in the datastream. Also, while some things
might still work, much stuff won't, e.g. when you try to find a word
(with some lone surrogate hidden in some place) starting with the same
word (but with some lone surrogate hidden in another place, or no such
surrogate).

> In other words, when you process 16-bit Unicode text it takes no effort to
> handle unpaired surrogates, other than making sure that you only assemble a
> supplementary code point when a lead surrogate is really followed by a trail
> surrogate. Hence little need for cleanup functions -- but if you need one,
> it's trivial to write one for UTF-16.

For some processing this is true, but it's rather short-sighted.

Regards, Martin.

-- 
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:duerst@it.aoyama.ac.jp

Next message: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Previous message: Asmus Freytag: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
In reply to: Markus Scherer: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Next in thread: Doug Ewell: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 03:01:39 CST