RE: What does it mean to "not be a valid string in Unicode"?

From: Whistler, Ken <ken.whistler_at_sap.com>
Date: Tue, 8 Jan 2013 01:52:28 +0000

Martin,

The kind of situation Markus is talking about is illustrated particularly well in collation. And there is a section 7.1.1 in UTS #10 specifically devoted to this issue,:

http://www.unicode.org/reports/tr10/#Handline_Illformed

When weighting Unicode 16-bit strings for collation, you can, of course, always detect an unpaired surrogate and return an error code or throw an exception, but that may not be the best strategy for an implementation.

The problem derives in part from the fact that for sorting, the comparison routine is generally buried deep down as a primitive comparison function in what may be a rather complicated sorting algorithm. Those algorithms often assume that the comparison routine is analogous to strcmp(), and will always return -1/0/1 (or negative/0/positive), and that it is not going to fail because it decides that some byte value in an input string is not valid in some particular character encoding. (Of course, the calling code needs to ensure it isn't handing off null pointers or unallocated objects, but that is par for the course for any string handling.)

Now if I want to adopt a particular sorting algorithm so it uses a UCA-compliant, multi-level collation algorithm for the actual string comparison, then by far the easiest way to do so is to build a function essentially comparable to strcmp() in structure, e.g. UCA_strcmp(context, string1, string2), which also always returns -1/0/1 for any two Unicode 16-bit strings. If I introduce a string validation aspect to this comparison routine, and return an error code or raise an exception, then I run the risk of marginally slowing down the most time-critical part of the sorting loop, as well as complicating the adaptation of the sorting code, to deal with extra error conditions. It is faster, more reliable and robust, and easier to adapt the code, if I simply specify for the weighting exactly what happens to any isolated surrogate in input strings, and compare accordingly. Hence the two alternative strategies suggested in Section 7.1.1 of UTS #10: either weight each maximal ill-formed subsequence as if it were U+FF
FD (with a primary weight), or weight each surrogate code point with a generated implicit weight, as if it were an unassigned code point. Either strategy works. And in fact, the conformance tests in CollationTest.zip for UCA include some ill-formed strings in the test data, so that implementations can test their handling of them, if they choose.

So in this kind of a case, what we are actually dealing with is: garbage in, principled, correct results out. ;-)

--Ken

> -----Original Message-----
 
> On 2013/01/08 3:27, Markus Scherer wrote:
>
> > Also, we commonly read code points from 16-bit Unicode strings, and
> > unpaired surrogates are returned as themselves and treated as such (e.g.,
> > in collation). That would not be well-formed UTF-16, but it's generally
> > harmless in text processing.
>
> Things like this are called "garbage in, garbage-out" (GIGO). It may be
> harmless, or it may hurt you later.
>
> Regards, Martin.
Received on Mon Jan 07 2013 - 19:53:32 CST

This archive was generated by hypermail 2.2.0 : Mon Jan 07 2013 - 19:53:32 CST