Re: What does it mean to "not be a valid string in Unicode"?

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Fri, 04 Jan 2013 18:08:22 -0800

Thanks for all the information.

Is there a most general sense in which there are constraints beyond all
characters being from within the range U+0000 ... U+10FFFF? If one is
concerned with computer security, oddities that are absolute should
raise a flag; somebody could be messing with my system. Perhaps, for
internal purposes, I have stored my Unicode string in an array of
non-negative integers, and now I'm passing around this array. I don't
know anything else about that string besides it being a Unicode string.
There are no /absolute/ constraints against having any of those
1114112_dec (110000_hex) code points appearing anywhere, correct? Oh
wait, actually there are the surrogates (D800 ... DFFF); perhaps I need
to exclude them. So what else might I have overlooked? For example, the
original C datatype named "string", as it is understood and manipulated
by the C standard library, has an /absolute/ prohibition against U+0000
anywhere inside. UTF-32 has an /absolute/ prohibition against anything
above 10FFFF. UTF-16 has an /absolute/ prohibition against broken
surrogate pairs. (Or so is my understanding. Mark Davis mentioned
"Unicode X-bit strings", but D76 (in sec. 3.9 of the standard) suggests
that there is no place for surrogate values outside of an encoding form;
that is: a surrogate is not a "Unicode scalar value". Perhaps "Unicode
X-bit string" should be outside of this discussion then, or I'll need to
read up on this more.)

Mark Davis' quote ("In effect, noncharacters can be thought of as
application-internal private-use code points.") would really suggest
that there are really no absolute constraints. I'm just checking that my
understanding of the matter is correct.

Stephan
Received on Fri Jan 04 2013 - 20:10:52 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 04 2013 - 20:10:58 CST