U+0000 in C strings (was: Re: Opinions on this Java URL?)

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Nov 14 2004 - 23:48:22 CST

  • Next message: Doug Ewell: "Re: Opinions on this Java URL?"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > Nulls are legal Unicode characters, also for use in plain text and
    > since ever in ASCII, and all ISO 8-bit charset standards. Why do you
    > want that a legal Unicode string containing NULL (U+0000) *characters*
    > become illegal when converted to C strings?

    Because it wasn't valid in C before? Because C programmers rely on
    U+0000 as an end-of-string indicator? This is about C strings, not
    Unicode conformance.

    > A null *CHARACTER* is valid in C string, because C does not mandate
    > the string encoding (which varies according to locale conventions at
    > run-time).
    > It just assigns a special role to the null *BYTE* as a end-of-string
    > terminator.

    The standard says in section 5.2.1, "A byte with all bits set to 0,
    called the null character, shall exist in the basic execution character
    set; it is used to terminate a character string." Obviously this
    presumes that bytes and characters are the same, even while section goes on to describe the behavior of multibyte characters. But
    it clearly does not provide any apparatus by which the "null character"
    can be divorced from the "null byte," such that one is valid and the
    other is not.

    > There are many reasons why one would want to store null *characters*
    > in C strings, using a proper escaping mechanism (a transport syntax
    > like the transformation of 00 generated by UTF-8, into C080) or an
    > encoding scheme (UTF-8 does not fit here, one needs another scheme
    > like the Sun modified version).

    As soon as you can think of one, let me know. I can think of plenty of
    *binary* protocols that require zero bytes, but no *text* protocols.

    Peter Kirk <peterkirk at qaya dot org> wrote:

    > A string of Unicode characters (including control characters as well
    > text) may consist of any valid Unicode character, and U+0000 is (for
    > better or for worse) a valid Unicode character. Therefore some such
    > escape mechanism is required to represent an arbitrary string of
    > Unicode characters (in a UTF-8-lookalike representation) in a way
    > compatible with C string handling.

    This has nothing to do with whether U+0000 is a valid Unicode character,
    or whether a string containing U+0000 is a valid Unicode string. Of
    course it is. But the convention in C is to treat it as an
    end-of-string marker.

    > Otherwise what would happen? Would it be acceptable for Java programs
    > to crash, or even throw error messages, if presented with Unicode
    > strings including U+0000?

    Peter, what do you think? Is that what I said? I said it should signal
    the end of the string, as it does in C.

    Perhaps a more suitable design for Java, one more in keeping with the
    design of Unicode, would have been to terminate strings with the
    noncharacter code point U+FFFF. That would have made any special
    handling of U+0000 unnecessary.

    This is becoming less and less important to me personally, as I spend
    most of my programming time using C++ with MFC (which has a CString
    type, whose implementation I generally don't care about) or C# (which
    has a built-in String type, whose implementation I generally don't care
    about). What worries me is the confusion and security hole implicit in
    having two different representations of U+0000, one whose bytewise
    representation contains the byte 0x00 and thus terminates a string, and
    another which does not and thus does not.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Sun Nov 14 2004 - 23:50:05 CST