From: Martin J. Dürst (email@example.com)
Date: Fri Nov 05 2010 - 02:56:57 CST
On 2010/11/05 2:46, Markus Scherer wrote:
> 16-bit Unicode is convenient in that when you find an unpaired surrogate
> (that is, it's not well-formed UTF-16) you can usually just treat it like a
> surrogate code point which normally has default properties much like an
> unassigned code point or noncharacter. It case-maps to itself, normalizes to
> itself, has default Unicode property values (except for the general
> category), etc.
Well, yes, you can handle it that way, but that's pretty much GIGO
(garbage in, garbage out) and dumping the problem on the next
person/software downwards in the datastream. Also, while some things
might still work, much stuff won't, e.g. when you try to find a word
(with some lone surrogate hidden in some place) starting with the same
word (but with some lone surrogate hidden in another place, or no such
> In other words, when you process 16-bit Unicode text it takes no effort to
> handle unpaired surrogates, other than making sure that you only assemble a
> supplementary code point when a lead surrogate is really followed by a trail
> surrogate. Hence little need for cleanup functions -- but if you need one,
> it's trivial to write one for UTF-16.
For some processing this is true, but it's rather short-sighted.
-- #-# Martin J. Dürst, Professor, Aoyama Gakuin University #-# http://www.sw.it.aoyama.ac.jp mailto:firstname.lastname@example.org
This archive was generated by hypermail 2.1.5 : Fri Nov 05 2010 - 03:01:39 CST