Re: discovering code points with embedded nulls

From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Wed Feb 05 2003 - 13:43:25 EST

  • Next message: Asmus Freytag: "Re: VS vs. P14 (was Re: Indic Devanagari Query)"

    Erik.Ostermueller@alltel.com wrote:

    > I'm dealing with an API that claims it doesn't support unicode characters with embedded nulls.
    ...

    > Test all constituent bytes for 0x00.

    This depends on the encoding form you are using (and the API is expecting):

    - UTF-8 encodes a Unicode string into a sequence of bytes;
       this sequence contains no 0x00 bytes.
       Btw., ASCII characters are encoded the same way as in ASCII.

    - UTF-16 encodes a Unicode string into a sequence of 16-bit units,
       hence it makes no sense to look at this encoding bytewise.
       If you nevertheless treat a 16-bit unit as a sequence of two bytes
       (repeat: this is a no-no), then you will most probably find
       0x00 bytes therein; in particular, every ASCII character is
       encoded as a sequence of the respective ASCII byte and a 0x00 byte
       (both orders are possible, cf.
    <http://www.unicode.org/faq/utf_bom.html>).

    - UTF-32 encodes a Unicode string into a sequence of 32-bit units,
       hence it makes no sense to look at this encoding bytewise.
       If you nevertheless treat a 32-bit unit as a sequence of four bytes
       (repeat: this is a no-no), then you will certainly find
       0x00 bytes therein; in particular, every ASCII character is
       encoded as a sequence of the respective ASCII byte and three
       0x00 bytes.

    Best wishes,
       Otto Stolz



    This archive was generated by hypermail 2.1.5 : Wed Feb 05 2003 - 14:26:54 EST