RE: discovering code points with embedded nulls

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Feb 05 2003 - 13:40:09 EST

  • Next message: Otto Stolz: "Re: discovering code points with embedded nulls"

    Erik Ostermueller wrote:
    > I'm dealing with an API that claims it doesn't support
    > unicode characters with embedded nulls.
    > I'm trying to figure out how much of a liability this is.

    If by "embedded nulls" they mean bytes of value zero, that library can
    *only* work with UTF-8. The other two UTF's cannot be supported in this way.

    But are you sure you understood clearly? Didn't they perhaps write "Unicode
    *strings* with embedded nulls? In that case they could have meant that null
    *characters* inside strings. I.e., they don't support strings containing the
    Unicode character U+0000, because that code is used as a string terminator.
    In this case, it would be a common and accepted limitation.

    > What is my best plan of attack for discovering precisely
    > which code points have embedded nulls
    > given a particular encoding? Didn't find it in the maillist archive.
    > I've googled for quite a while with no luck.

    The question doesn't make sense. However:

    UTF-8: Only one character is affected (U+0000 itself);

    UTF-16: In range U+0000..U+FFFF (Basic Multilingual Plane), there are of
    course exactly 511 characters affected (all those of form U+00xx or U+xx00),
    484 of which are actually assigned. However, a few of these code points are
    high or low surrogates, which means that also many characters in range
    U+010000..U+10FFFF are affected.

    UTF-32: All characters are affected, because the high byte of an UTF-32 unit
    is always 0x00.

    > I'll want to do this for a few different versions of unicode
    > and a few different encodings.

    Most single and double-byte encodings behave like UTF-8 (i.e., a single
    zero-byte is only needed to encode U+0000 itself).

    > What if I write a program using some of the data files
    > available at unicode.org?
    > Am I crazy (I'm new at this stuff) or am I getting warm?
    > Perhaps this data file:
    > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ?
    >
    > Algorithm:
    > INPUT: Name of unicode code point file
    > INPUT: Name of encoding (perhaps UTF-8)
    >
    > Read code point from file.
    > Expand code point to encoded format for the given encoding.
    > Test all constituent bytes for 0x00.
    > Goto next code point from file.

    That would be totally useless, I am afraid.

    The only UTF for which this count makes sense is UTF-8, and the result is
    "one".

    _ Marco



    This archive was generated by hypermail 2.1.5 : Wed Feb 05 2003 - 14:20:42 EST