From: Asmus Freytag (firstname.lastname@example.org)
Date: Mon Jun 29 2009 - 21:07:19 CDT
On 6/29/2009 6:42 PM, Adam Twardoch wrote:
> Doug Ewell wrote:
>> I think the original poster's point was that he didn't know what his
>> input would look like, let alone have control over it.
> Right. But the point is, U+0000 is a valid Unicode codepoint, the NULL.
> It may be part of a Unicode string. It may be of limited use, but it is
> a codepoint. Using a codepoint for termination is not the best idea.
Correct, but because of rampant use of NUL as terminator, ordinary
*text* files do not contain NUL as part of the data.
Nevertheless U+0000 is both a code point and a character.
> U+FFFF is not a valid Unicode codepoint,
Incorrect. It's definitely a code point.
> it's not part of Unicode.
Incorrect. It's definitely part of the Unicode Standard.
> It may not be part of a Unicode string.
Incorrect. What it should not be is part of a Unicode string that claims
to contain only characters. Because U+FFFF is not a character, merely a
code point. (Unicode strings are not required to be well-formed - unless
that is claimed separately).
It should not be part of plain text data.
> So by definition, it can be used
> for termination.
Because the terminator isn't considered part of the data, using a
convention that has a non-character is fine - however, use of a
non-character in public interchange would be somewhat questionable,
unless it was firmly scoped by a well-defined protocol.
Creating random data with U+FFFF in them, hoping other systems will
treat them as terminators goes against the idea that U+FFFF as a
non-character is not part of interchanged data. U+FFFF is definitely
never plain text.
This archive was generated by hypermail 2.1.5 : Mon Jun 29 2009 - 21:08:35 CDT