Re: Zero termination

From: Asmus Freytag (
Date: Mon Jun 29 2009 - 21:07:19 CDT

  • Next message: Doug Ewell: "Re: Zero termination"

    On 6/29/2009 6:42 PM, Adam Twardoch wrote:
    > Doug Ewell wrote:
    >> I think the original poster's point was that he didn't know what his
    >> input would look like, let alone have control over it.
    > Right. But the point is, U+0000 is a valid Unicode codepoint, the NULL.
    > It may be part of a Unicode string. It may be of limited use, but it is
    > a codepoint. Using a codepoint for termination is not the best idea.
    Correct, but because of rampant use of NUL as terminator, ordinary
    *text* files do not contain NUL as part of the data.

    Nevertheless U+0000 is both a code point and a character.
    > U+FFFF is not a valid Unicode codepoint,
    Incorrect. It's definitely a code point.
    > it's not part of Unicode.
    Incorrect. It's definitely part of the Unicode Standard.
    > It may not be part of a Unicode string.
    Incorrect. What it should not be is part of a Unicode string that claims
    to contain only characters. Because U+FFFF is not a character, merely a
    code point. (Unicode strings are not required to be well-formed - unless
    that is claimed separately).

    It should not be part of plain text data.
    > So by definition, it can be used
    > for termination.
    Because the terminator isn't considered part of the data, using a
    convention that has a non-character is fine - however, use of a
    non-character in public interchange would be somewhat questionable,
    unless it was firmly scoped by a well-defined protocol.

    Creating random data with U+FFFF in them, hoping other systems will
    treat them as terminators goes against the idea that U+FFFF as a
    non-character is not part of interchanged data. U+FFFF is definitely
    never plain text.


    This archive was generated by hypermail 2.1.5 : Mon Jun 29 2009 - 21:08:35 CDT