Re: U+0000 in C strings

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Nov 15 2004 - 10:38:46 CST

  • Next message: John Cowan: "Re: U+0000 in C strings (was: Re: Opinions on this Java URL?)"

    Peter Kirk <peterkirk at qaya dot org> wrote:

    >> I'd still like to know what practical, real-world TEXT-related
    >> benefits would derive from allowing U+0000 in strings of TEXT in a C
    >> program.
    >
    > The practical situation which I have in mind (although not important
    > to me personally as I do very little programming - I am making this
    > point more for the general good) is when (hypothetically) I am trying
    > to write a program in C, or Java, or whatever, to process an arbitrary
    > string of Unicode characters, perhaps received from the Internet,
    > before handing them on to a higher level processor. My program works
    > fine until someone, for whatever (possibly malicious) reason, sends a
    > string containing U+0000. At that point my program crashes, or does
    > something I did not intend which may be a security risk. It might well
    > be a security risk if the task of my program is to scan the string for
    > security issues, and if none are found it passes on the Unicode string
    > including U+0000 and what follows it.

    The key to your scenario is "an arbitrary string of Unicode characters."
    Text processing is a special case of arbitrary "binary" data processing
    (a misnomer, of course, since all computer data is "binary," but we have
    no better term for "non-text").

    Data that is identified as "text" is generally expected to follow
    certain conventions, and may be subjected to certain operations that
    would "corrupt" non-text data. For example, CR-LF pairs may be
    converted to LF, or vice versa, to match the line-end conventions of the
    server (or expected client). Trailing spaces and tabs might be deleted,
    and tabs may be expanded to spaces (or vice versa). The program I am
    using right now, Outlook Express, takes the continuous long line of text
    that I am typing and chops it into lines of no more than 72 characters.
    These are common text operations, and if I applied them to "binary"
    data, such as an executable, I would be corrupting the data.

    If I am working with text, I can expect that CR-LF means "end of line,"
    and bare LF also means "end of line," and the two can be freely
    interconverted to suit my system. I can expect that the sequence LF LF
    CR LF is some sort of corruption, since it mixes the two conventions,
    probably unintentionally.

    Similarly, I can expect that the presence of a zero byte in the middle
    of supposedly "text" data represents some sort of error, perhaps a
    glitch brought about by line noise or something. If I have what I
    believe to be "text," I can expect to be able to use C-style text I/O,
    which reads a "line" from disk by reading characters until an LF is hit,
    without fear that the LF is supposed to be "part of the string" and that
    this will somehow corrupt the data. I can process the text using tools
    like strlen() and strchr() and strcpy(), all of which assume that the
    zero byte represents the end of the string.

    If I am working with arbitrary binary data, I cannot make any of these
    assumptions, and thus I cannot use any of these text-processing tools
    safely. That is the difference which governs the appropriateness or
    inappropriateness of finding a zero byte in data.

    > What should my program have done? It could have flagged U+0000 as an
    > illegal character, but it is not; there might be a good reason for it
    > being in the string, and it is not the business of my program to
    > interpret such things. If I am going to use string handling at all, I
    > need to use some kind of escape mechanism to stop this legal U+0000
    > being misinterpreted. For better or for worse, this Java provides a
    > mechanism for this situation.

    How does Java indicate the end of a string? It can't use the value
    U+0000, as C does, because the "modified UTF-8" sequence C0 80 still
    gets translated as U+0000. And if the answer is that Java uses a length
    count, and therefore doesn't care about zero bytes, then why is there a
    need to encode U+0000 specially?

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Mon Nov 15 2004 - 10:45:41 CST