Re: What does it mean to "not be a valid string in Unicode"?

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Tue, 08 Jan 2013 18:52:52 +0900

On 2013/01/08 14:43, Stephan Stiller wrote:

> Wouldn't the clean way be to ensure valid strings (only) when they're
> built

Of course, the earlier erroneous data gets caught, the better. The
problem is that error checking is expensive, both in lines of code and
in execution time (I think there is data showing that in any real-life
programs, more than 50% or 80% or so is error checking, but I forgot the
details).

So indeed as Ken has explained with a very good example, it doesn't make
sense to check at every corner.

> and then make sure that string algorithms (only) preserve
> well-formedness of input?
>
> Perhaps this is how the system grew, but it seems to be that it's
> yet another legacy of C pointer arithmetic and
> about convenience of implementation rather than a
> safety or performance issue.

Convenience of implementation is an important aspect in programming.

>>> Things like this are called "garbage in, garbage-out" (GIGO). It may be
>>> harmless, or it may hurt you later.
>> So in this kind of a case, what we are actually dealing with is:
>> garbage in, principled, correct results out. ;-)

Sorry, but I have to disagree here. If a list of strings contains items
with lone surrogates (garbage), then sorting them doesn't make the
garbage go away, even if the items may be sorted in "correct" order
according to some criterion.

Regards, Martin.
Received on Tue Jan 08 2013 - 03:56:54 CST

This archive was generated by hypermail 2.2.0 : Tue Jan 08 2013 - 03:56:56 CST