RE: discovering code points with embedded nulls

From: Erik.Ostermueller@alltel.com
Date: Wed Feb 05 2003 - 14:12:39 EST

  • Next message: Kenneth Whistler: "RE: discovering code points with embedded nulls"

    I'm replying to myself, here.
    Thank you all for so many quick and helpful responses.

    As most of you pointed out, I misread the documentation -- which is doc for multi-byte strings only (and not wide strings).
    So I was brain dead when I asked about encodings other than UTF-8.

    The doc states (in a number of places)
    "This function is incompatible with cs strings with embedded nulls. This function may be incompatible with cs MBCS strings."

    To me, the doc suggests that someone out there might want to pass UTF-8 data with an embedded null
    that is used for something other than terminating the string.

    From what I'm hearing from you all is that a null in UTF-8 is for termination and termination only.
    Is this correct?

    thanks again,

    --Erik

    > -----Original Message-----
    > From: Ostermueller, Erik
    > Sent: Wednesday, February 05, 2003 10:43 AM
    > To: unicode@unicode.org
    > Subject: discovering code points with embedded nulls
    >
    >
    > Hello, all.
    >
    > I'm dealing with an API that claims it doesn't support
    > unicode characters with embedded nulls.
    > I'm trying to figure out how much of a liability this is.
    >
    > What is my best plan of attack for discovering
    > precisely which code points have embedded nulls
    > given a particular encoding? Didn't find it in the
    > maillist archive.
    > I've googled for quite a while with no luck.
    >
    > I'll want to do this for a few different versions of
    > unicode and a few different encodings.
    > What if I write a program using some of the data files
    > available at unicode.org?
    > Am I crazy (I'm new at this stuff) or am I getting warm?
    > Perhaps this data file:
    > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ?
    >
    > Algorithm:
    > INPUT: Name of unicode code point file
    > INPUT: Name of encoding (perhaps UTF-8)
    >
    > Read code point from file.
    > Expand code point to encoded format for the given encoding.
    > Test all constituent bytes for 0x00.
    > Goto next code point from file.
    >
    > Thanks in advance for any help,
    >
    > --Erik O.
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Feb 05 2003 - 15:28:22 EST