Re: What does it mean to "not be a valid string in Unicode"?

From: Mark Davis ☕ <mark_at_macchiato.com>
Date: Fri, 4 Jan 2013 17:12:23 -0800

To assess whether a string is invalid, it all depends on what the string is
supposed to be.

1. As Ken says, if a string is supposed to be in a given encoding form
(UTF), but it consists of an ill-formed sequence of code units for that
encoding form, it would be invalid. So an isolated surrogate (eg 0xD800) in
UTF-16 or any surrogate (eg 0x0000D800) in UTF-32 would make the string
invalid. For example, a Java String may be an invalid UTF-16 string. See
http://www.unicode.org/glossary/#unicode_encoding_form

2. However, a "Unicode X-bit string" does not have the same restrictions:
it may contain sequences that would be ill-formed in the corresponding UTF-X
encoding form. So a Java String is always a valid Unicode 16-bit string.
See http://www.unicode.org/glossary/#unicode_string

3. Noncharacters are also valid in interchange, depending on the sense of
"interchange". The TUS says ""In effect, noncharacters can be thought of as
application-internal private-use code points." If I couldn't interchange
them ever, even internal to my application, or between different modules
that compose my application, they'd be pointless. They are, however,
strongly discouraged in *public* interchange. The glossary entry and some
of the standard text is a bit old here, and needs to be clarified.

4. The quotation "we select a substring that begins with a combining
character, this new string will not be a valid string in Unicode." is
wrong. It *is* a valid Unicode string. It isn't particularly useful in
isolation, but it is valid. For some *specific purpose*, any particular
string might be invalid. For example, the string mark#d might be invalid in
some systems as a password, where # is disallowed, or where passwords might
be required to be 8 characters long.

Mark <https://plus.google.com/114199149796022210033>
*
*
*— Il meglio è l’inimico del bene —*
**

On Fri, Jan 4, 2013 at 3:10 PM, Stephan Stiller
<stephan.stiller_at_gmail.com>wrote:

>
> A Unicode string in UTF-8 encoding form could be ill-formed if the bytes
>> don't follow the specification for UTF-8, for example.
>>
> Given that answer, add "in UTF-32" to my email just now, for simplicity's
> sake. Or let's simply assume we're dealing with some sort of sequence of
> abstract integers from hex+0 to hex+10FFFF, to abstract away from "encoding
> form" issues.
>
> Stephan
>
>
>
Received on Fri Jan 04 2013 - 19:15:14 CST

This archive was generated by hypermail 2.2.0 : Fri Jan 04 2013 - 19:15:15 CST