Re: What does it mean to "not be a valid string in Unicode"?

From: Stephan Stiller <>
Date: Fri, 04 Jan 2013 18:08:22 -0800

Thanks for all the information.

Is there a most general sense in which there are constraints beyond all
characters being from within the range U+0000 ... U+10FFFF? If one is
concerned with computer security, oddities that are absolute should
raise a flag; somebody could be messing with my system. Perhaps, for
internal purposes, I have stored my Unicode string in an array of
non-negative integers, and now I'm passing around this array. I don't
know anything else about that string besides it being a Unicode string.
There are no /absolute/ constraints against having any of those
1114112_dec (110000_hex) code points appearing anywhere, correct? Oh
wait, actually there are the surrogates (D800 ... DFFF); perhaps I need
to exclude them. So what else might I have overlooked? For example, the
original C datatype named "string", as it is understood and manipulated
by the C standard library, has an /absolute/ prohibition against U+0000
anywhere inside. UTF-32 has an /absolute/ prohibition against anything
above 10FFFF. UTF-16 has an /absolute/ prohibition against broken
surrogate pairs. (Or so is my understanding. Mark Davis mentioned
"Unicode X-bit strings", but D76 (in sec. 3.9 of the standard) suggests
that there is no place for surrogate values outside of an encoding form;
that is: a surrogate is not a "Unicode scalar value". Perhaps "Unicode
X-bit string" should be outside of this discussion then, or I'll need to
read up on this more.)

Mark Davis' quote ("In effect, noncharacters can be thought of as
application-internal private-use code points.") would really suggest
that there are really no absolute constraints. I'm just checking that my
understanding of the matter is correct.

Received on Fri Jan 04 2013 - 20:10:52 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 04 2013 - 20:10:58 CST