Re: U+0000 in C strings

From: Peter Kirk (
Date: Mon Nov 15 2004 - 11:14:15 CST

  • Next message: Philippe Verdy: "Re: U+0000 in C strings"

    On 15/11/2004 16:38, Doug Ewell wrote:

    >Peter Kirk <peterkirk at qaya dot org> wrote:
    >>>I'd still like to know what practical, real-world TEXT-related
    >>>benefits would derive from allowing U+0000 in strings of TEXT in a C
    >>The practical situation which I have in mind (although not important
    >>to me personally as I do very little programming - I am making this
    >>point more for the general good) is when (hypothetically) I am trying
    >>to write a program in C, or Java, or whatever, to process an arbitrary
    >>string of Unicode characters, perhaps received from the Internet,
    >>before handing them on to a higher level processor. My program works
    >>fine until someone, for whatever (possibly malicious) reason, sends a
    >>string containing U+0000. At that point my program crashes, or does
    >>something I did not intend which may be a security risk. It might well
    >>be a security risk if the task of my program is to scan the string for
    >>security issues, and if none are found it passes on the Unicode string
    >>including U+0000 and what follows it.
    >The key to your scenario is "an arbitrary string of Unicode characters."
    >Text processing is a special case of arbitrary "binary" data processing
    >(a misnomer, of course, since all computer data is "binary," but we have
    >no better term for "non-text").

    OK, maybe by your strict definition what I am talking about is not TEXT
    processing. But neither is it "binary". But it is processing of a valid
    sequence of Unicode characters, as defined for example in Unicode
    conformance clause C10:

    > C10 When a process purports not to modify the interpretation of a
    > valid coded character representation, it shall make no change to that
    > coded character representation other than the possible replacement of
    > character sequences by their canonical-equivalent sequences or the
    > deletion of noncharacter code points.

    Suppose I am implementing a process, any process, which "purports not to
    modify the interpretation of a valid coded character representation" and
    so must conform to C10. Since U+0000 is not a noncharacter code point,
    my process must not delete U+0000, nor must it delete or ignore
    characters which follow U+0000. If my process acts non-conformantly by
    doing either of these things, it damages valid data, and creates a
    security risk. My process therefore needs to store its data in a data
    type which accepts U+0000 in the middle of a sequence. A UTF-8 encoded C
    string is not such a type, and so cannot be used in a process conforming
    to C10. The Java type which people are objecting to is such a type, and
    so can be used in a process conforming to C10.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Mon Nov 15 2004 - 12:00:12 CST