Re: U+0000 in C strings

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Nov 15 2004 - 11:14:15 CST

Next message: Philippe Verdy: "Re: U+0000 in C strings"

Previous message: Mark Davis: "Re: U+0000 in C strings"
In reply to: Doug Ewell: "Re: U+0000 in C strings"
Next in thread: Philippe Verdy: "Re: U+0000 in C strings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 15/11/2004 16:38, Doug Ewell wrote:

>Peter Kirk <peterkirk at qaya dot org> wrote:
>
>
>
>>>I'd still like to know what practical, real-world TEXT-related
>>>benefits would derive from allowing U+0000 in strings of TEXT in a C
>>>program.
>>>
>>>
>>The practical situation which I have in mind (although not important
>>to me personally as I do very little programming - I am making this
>>point more for the general good) is when (hypothetically) I am trying
>>to write a program in C, or Java, or whatever, to process an arbitrary
>>string of Unicode characters, perhaps received from the Internet,
>>before handing them on to a higher level processor. My program works
>>fine until someone, for whatever (possibly malicious) reason, sends a
>>string containing U+0000. At that point my program crashes, or does
>>something I did not intend which may be a security risk. It might well
>>be a security risk if the task of my program is to scan the string for
>>security issues, and if none are found it passes on the Unicode string
>>including U+0000 and what follows it.
>>
>>
>
>The key to your scenario is "an arbitrary string of Unicode characters."
>Text processing is a special case of arbitrary "binary" data processing
>(a misnomer, of course, since all computer data is "binary," but we have
>no better term for "non-text").
>
>

OK, maybe by your strict definition what I am talking about is not TEXT
processing. But neither is it "binary". But it is processing of a valid
sequence of Unicode characters, as defined for example in Unicode
conformance clause C10:

> C10 When a process purports not to modify the interpretation of a
> valid coded character representation, it shall make no change to that
> coded character representation other than the possible replacement of
> character sequences by their canonical-equivalent sequences or the
> deletion of noncharacter code points.

Suppose I am implementing a process, any process, which "purports not to
modify the interpretation of a valid coded character representation" and
so must conform to C10. Since U+0000 is not a noncharacter code point,
my process must not delete U+0000, nor must it delete or ignore
characters which follow U+0000. If my process acts non-conformantly by
doing either of these things, it damages valid data, and creates a
security risk. My process therefore needs to store its data in a data
type which accepts U+0000 in the middle of a sequence. A UTF-8 encoded C
string is not such a type, and so cannot be used in a process conforming
to C10. The Java type which people are objecting to is such a type, and
so can be used in a process conforming to C10.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Philippe Verdy: "Re: U+0000 in C strings"
Previous message: Mark Davis: "Re: U+0000 in C strings"
In reply to: Doug Ewell: "Re: U+0000 in C strings"
Next in thread: Philippe Verdy: "Re: U+0000 in C strings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Nov 15 2004 - 12:00:12 CST