From: Doug Ewell (dewell@adelphia.net)
Date: Sun Nov 14 2004 - 23:48:22 CST
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> Nulls are legal Unicode characters, also for use in plain text and
> since ever in ASCII, and all ISO 8-bit charset standards. Why do you
> want that a legal Unicode string containing NULL (U+0000) *characters*
> become illegal when converted to C strings?
Because it wasn't valid in C before? Because C programmers rely on
U+0000 as an end-of-string indicator? This is about C strings, not
Unicode conformance.
> A null *CHARACTER* is valid in C string, because C does not mandate
> the string encoding (which varies according to locale conventions at
> run-time).
>
> It just assigns a special role to the null *BYTE* as a end-of-string
> terminator.
The standard says in section 5.2.1, "A byte with all bits set to 0,
called the null character, shall exist in the basic execution character
set; it is used to terminate a character string." Obviously this
presumes that bytes and characters are the same, even while section
5.2.1.2 goes on to describe the behavior of multibyte characters. But
it clearly does not provide any apparatus by which the "null character"
can be divorced from the "null byte," such that one is valid and the
other is not.
> There are many reasons why one would want to store null *characters*
> in C strings, using a proper escaping mechanism (a transport syntax
> like the transformation of 00 generated by UTF-8, into C080) or an
> encoding scheme (UTF-8 does not fit here, one needs another scheme
> like the Sun modified version).
As soon as you can think of one, let me know. I can think of plenty of
*binary* protocols that require zero bytes, but no *text* protocols.
Peter Kirk <peterkirk at qaya dot org> wrote:
> A string of Unicode characters (including control characters as well
as
> text) may consist of any valid Unicode character, and U+0000 is (for
> better or for worse) a valid Unicode character. Therefore some such
> escape mechanism is required to represent an arbitrary string of
> Unicode characters (in a UTF-8-lookalike representation) in a way
> compatible with C string handling.
This has nothing to do with whether U+0000 is a valid Unicode character,
or whether a string containing U+0000 is a valid Unicode string. Of
course it is. But the convention in C is to treat it as an
end-of-string marker.
> Otherwise what would happen? Would it be acceptable for Java programs
> to crash, or even throw error messages, if presented with Unicode
> strings including U+0000?
Peter, what do you think? Is that what I said? I said it should signal
the end of the string, as it does in C.
Perhaps a more suitable design for Java, one more in keeping with the
design of Unicode, would have been to terminate strings with the
noncharacter code point U+FFFF. That would have made any special
handling of U+0000 unnecessary.
This is becoming less and less important to me personally, as I spend
most of my programming time using C++ with MFC (which has a CString
type, whose implementation I generally don't care about) or C# (which
has a built-in String type, whose implementation I generally don't care
about). What worries me is the confusion and security hole implicit in
having two different representations of U+0000, one whose bytewise
representation contains the byte 0x00 and thus terminates a string, and
another which does not and thus does not.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Sun Nov 14 2004 - 23:50:05 CST