From: Doug Ewell (firstname.lastname@example.org)
Date: Fri May 23 2003 - 00:05:16 EDT
Philippe Verdy <verdy_p at wanadoo dot fr> wrote:
> The main reason why the 0x00 byte causes problems is because it is
> most often used as a string terminator, unlike what ASCII or Unicode
> defines for the NULL character. In this case, one cannot encode it
> because the device or protocoldoes not support sending a separate
> length specifier and needs the 0x00 to terminate the string, and thus
> a NULL character in a Unicode string could not be encoded even if it's
Everything Ken said about the advisability, and the past and present
permissibility, of using non-shortest UTF-8 is true.
I'd like to ask a different question, one that steps away from Unicode
for a minute and addresses the broader concept of text storage and
What real-world situations call for a NULL character to be stored
as part of a text string, in conflict with its use in the C
language (etc.) as a string terminator?
Basically you are making the claim that 0x00 might be used not only as a
string terminator (not part of the string per se) but also for some
other purpose WITHIN the string, so that the two uses of 0x00 need to be
distinguished. But what other uses of 0x00 are there within a string?
I can't think of any.
There's a reason why neither Unicode nor any other coded character set
(including the ISO 2022 mechanism) assigns a specific function to 0x00.
It is too valuable in its role as a NULL character.
Of course, an arbitrary binary stream might well contain 0x00 bytes, but
then it would not be appropriate, for a variety of reasons, to attempt
to perform text processing functions on such a stream.
This archive was generated by hypermail 2.1.5 : Fri May 23 2003 - 00:50:54 EDT