Re: What does it mean to "not be a valid string in Unicode"?

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Fri, 04 Jan 2013 22:21:14 -0800

> If you are concerned with computer security

If for example I sit on a committee that devises a new encoding form, I
would need to be concerned with the question which /sequences of Unicode
code points/ are sound. If this is the same as "sequences of Unicode
scalar values", I would need to exclude surrogates, if I read the
standard correctly (this wasn't obvious to me on first inspection btw).
If for example I sit on a committee that designs an optimized
compression algorithm for Unicode strings (yep, I do know about SCSU), I
might want to first convert them to some canonical internal form (say,
my array of non-negative integers). If U+<surrogate values> can be
assumed to not exist, there are 2048 fewer values a code point can
assume; that's good for compression, and I'll subtract 2048 from those
large scalar values in a first step. Etc etc. So I do think there are a
number of very general use cases where this question arises.

> For example, the original C datatype named "string", as it is
> understood and manipulated by the C standard library, has an
> /absolute/ prohibition against U+0000 anywhere inside.
>
>
> That's not as much a prohibition as an artifact of NUL-termination of
> strings. In more modern libraries, the string contents and its
> explicit length are stored together, and you can store a 00 byte just
> fine, for example in a C++ string.

Yep.

If my question is really underspecified or ill-formed, a listing of
possible interpretations somewhere (with case-specific answers) might be
useful.

Stephan
Received on Sat Jan 05 2013 - 00:25:46 CST

This archive was generated by hypermail 2.2.0 : Sat Jan 05 2013 - 00:25:49 CST