Re: What does it mean to "not be a valid string in Unicode"?

From: Mark Davis ☕ <>
Date: Mon, 7 Jan 2013 22:50:27 -0800

In practice and by design, treating isolated surrogates the same as
reserved code points in processing, and then cleaning up on conversion to
UTFs works just fine. It is a tradeoff that is up to the implementation.

It has nothing to do with a "legacy of C pointer arithmetic". It does
represent a pragmatic choice some time ago, but there is no need getting
worked up about it. Human scripts and their representation on computers is
quite complex enough; in the grand scheme of things the handling of
surrogates in implementations pales in significance.

Mark <>
*— Il meglio è l’inimico del bene —*

On Mon, Jan 7, 2013 at 9:43 PM, Stephan Stiller

> Things like this are called "garbage in, garbage-out" (GIGO). It may be
>>> harmless, or it may hurt you later.
>> So in this kind of a case, what we are actually dealing with is: garbage
>> in, principled, correct results out. ;-)
> Wouldn't the clean way be to ensure valid strings (only) when they're
> built and then make sure that string algorithms (only) preserve
> well-formedness of input?
> Perhaps this is how the system grew, but it seems to be that it's
> yet another legacy of C pointer arithmetic and
> about convenience of implementation
> rather than a
> safety or
> performance
> issue.
> Stephan
Received on Tue Jan 08 2013 - 00:53:34 CST

This archive was generated by hypermail 2.2.0 : Tue Jan 08 2013 - 00:53:35 CST