RE: discovering code points with embedded nulls

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Feb 05 2003 - 15:36:33 EST

  • Next message: Jim Allan: "Re: VS vs. P14 (was Re: Indic Devanagari Query)"

    Erik followed up:

    > From what I'm hearing from you all is that a null
    > in UTF-8 is for termination and termination only.
    > Is this correct?

    Not quite. A null byte (0x00) in UTF-8 is only a
    representation of the NULL character (U+0000). It can
    be present in UTF-8 for whatever purposes one might use
    a NULL in textual data.

    One very common usage of a NULL is as a convention for
    string termination. And if you are using NULL's that way,
    then of course any API which depends on that convention
    will have a problem with NULL characters embedded *in*
    the string for other reasons, since they will prematurely
    detect end-of-string in their processing.

    If your string termination convention does *not* use
    NULL (but instead some other mechanism such as explicit
    length attributes), then there is no inherent reason why
    you could not use NULL's for some other purpose embedded
    in the string -- for example to delimit fielded data
    within the string, or some other purpose. In such cases,
    if your Unicode data is represented in the UTF-8 encoding
    form, then those NULL's will end up as 0x00 embedded
    bytes, because that is how NULL's characters are represented
    in UTF-8.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Feb 05 2003 - 16:27:04 EST