Re: CodePage Information

From: Doug Ewell (
Date: Fri May 23 2003 - 00:05:16 EDT

  • Next message: Asmus Freytag: "Re: Is it true that Unicode is insufficient for Oriental languages?"

    Philippe Verdy <verdy_p at wanadoo dot fr> wrote:

    > The main reason why the 0x00 byte causes problems is because it is
    > most often used as a string terminator, unlike what ASCII or Unicode
    > defines for the NULL character. In this case, one cannot encode it
    > because the device or protocoldoes not support sending a separate
    > length specifier and needs the 0x00 to terminate the string, and thus
    > a NULL character in a Unicode string could not be encoded even if it's
    > needed.

    Everything Ken said about the advisability, and the past and present
    permissibility, of using non-shortest UTF-8 is true.

    I'd like to ask a different question, one that steps away from Unicode
    for a minute and addresses the broader concept of text storage and

        What real-world situations call for a NULL character to be stored
        as part of a text string, in conflict with its use in the C
        language (etc.) as a string terminator?

    Basically you are making the claim that 0x00 might be used not only as a
    string terminator (not part of the string per se) but also for some
    other purpose WITHIN the string, so that the two uses of 0x00 need to be
    distinguished. But what other uses of 0x00 are there within a string?
    I can't think of any.

    There's a reason why neither Unicode nor any other coded character set
    (including the ISO 2022 mechanism) assigns a specific function to 0x00.
    It is too valuable in its role as a NULL character.

    Of course, an arbitrary binary stream might well contain 0x00 bytes, but
    then it would not be appropriate, for a variety of reasons, to attempt
    to perform text processing functions on such a stream.

    -Doug Ewell
     Fullerton, California

    This archive was generated by hypermail 2.1.5 : Fri May 23 2003 - 00:50:54 EDT