Re: What does it mean to "not be a valid string in Unicode"?

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Fri, 04 Jan 2013 15:03:27 -0800

> What does it mean to not be a valid string in Unicode?

Is there a concise answer in one place? For example, if one uses the
noncharacters just mentioned by Ken Whistler ("intended for
process-internal uses, but [...] not permitted for interchange"), what
precisely does that mean? /Naively/, all strings over the alphabet
{U+0000, ..., U+10FFFF} seem "valid", but section 16.7 clarifies that
noncharacters are "forbidden for use in open interchange of Unicode text
data". I'm assuming there is a set of isValidString(...)-type ICU calls
that deals with this? Yes, I'm sure this has been asked before and ICU
documentation has an answer, but this page
     http://www.unicode.org/faq/utf_bom.html
contains lots of distributed factlets where it's imo unclear how to add
them up. An implementation can use characters that are "invalid in
interchange", but I wouldn't expect implementation-internal aspects of
anything to be subject to any standard in the first place (so, why write
this?). Also it makes me wonder about the runtime of the algorithm
checking for valid Unicode strings of a particular length. Of course the
answer is "linear" complexity-wise, but as it or a variation of it
(depending on how one treats holes and noncharacters) will be dependent
on the positioning of those special characters, how fast does this
function perform in practice? This also relates to Markus Scherer's
reply to the "holes" thread just now.

Stephan
Received on Fri Jan 04 2013 - 17:04:43 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 04 2013 - 17:04:49 CST