From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Nov 15 2004 - 11:14:15 CST
On 15/11/2004 16:38, Doug Ewell wrote:
>Peter Kirk <peterkirk at qaya dot org> wrote:
>
>
>
>>>I'd still like to know what practical, real-world TEXT-related
>>>benefits would derive from allowing U+0000 in strings of TEXT in a C
>>>program.
>>>
>>>
>>The practical situation which I have in mind (although not important
>>to me personally as I do very little programming - I am making this
>>point more for the general good) is when (hypothetically) I am trying
>>to write a program in C, or Java, or whatever, to process an arbitrary
>>string of Unicode characters, perhaps received from the Internet,
>>before handing them on to a higher level processor. My program works
>>fine until someone, for whatever (possibly malicious) reason, sends a
>>string containing U+0000. At that point my program crashes, or does
>>something I did not intend which may be a security risk. It might well
>>be a security risk if the task of my program is to scan the string for
>>security issues, and if none are found it passes on the Unicode string
>>including U+0000 and what follows it.
>>
>>
>
>The key to your scenario is "an arbitrary string of Unicode characters."
>Text processing is a special case of arbitrary "binary" data processing
>(a misnomer, of course, since all computer data is "binary," but we have
>no better term for "non-text").
>
>
OK, maybe by your strict definition what I am talking about is not TEXT
processing. But neither is it "binary". But it is processing of a valid
sequence of Unicode characters, as defined for example in Unicode
conformance clause C10:
> C10 When a process purports not to modify the interpretation of a
> valid coded character representation, it shall make no change to that
> coded character representation other than the possible replacement of
> character sequences by their canonical-equivalent sequences or the
> deletion of noncharacter code points.
Suppose I am implementing a process, any process, which "purports not to
modify the interpretation of a valid coded character representation" and
so must conform to C10. Since U+0000 is not a noncharacter code point,
my process must not delete U+0000, nor must it delete or ignore
characters which follow U+0000. If my process acts non-conformantly by
doing either of these things, it damages valid data, and creates a
security risk. My process therefore needs to store its data in a data
type which accepts U+0000 in the middle of a sequence. A UTF-8 encoded C
string is not such a type, and so cannot be used in a process conforming
to C10. The Java type which people are objecting to is such a type, and
so can be used in a process conforming to C10.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Mon Nov 15 2004 - 12:00:12 CST