Re: Utility to report and repair broken surrogate pairs in UTF-16 text

From: Martin J. Dürst (
Date: Fri Nov 05 2010 - 02:56:57 CST

  • Next message: Martin J. Dürst: "Re: Utility to report and repair broken surrogate pairs in UTF-16 text"

    On 2010/11/05 2:46, Markus Scherer wrote:

    > 16-bit Unicode is convenient in that when you find an unpaired surrogate
    > (that is, it's not well-formed UTF-16) you can usually just treat it like a
    > surrogate code point which normally has default properties much like an
    > unassigned code point or noncharacter. It case-maps to itself, normalizes to
    > itself, has default Unicode property values (except for the general
    > category), etc.

    Well, yes, you can handle it that way, but that's pretty much GIGO
    (garbage in, garbage out) and dumping the problem on the next
    person/software downwards in the datastream. Also, while some things
    might still work, much stuff won't, e.g. when you try to find a word
    (with some lone surrogate hidden in some place) starting with the same
    word (but with some lone surrogate hidden in another place, or no such

    > In other words, when you process 16-bit Unicode text it takes no effort to
    > handle unpaired surrogates, other than making sure that you only assemble a
    > supplementary code point when a lead surrogate is really followed by a trail
    > surrogate. Hence little need for cleanup functions -- but if you need one,
    > it's trivial to write one for UTF-16.

    For some processing this is true, but it's rather short-sighted.

    Regards, Martin.

    #-# Martin J. Dürst, Professor, Aoyama Gakuin University

    This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 03:01:39 CST