RE: discovering code points with embedded nulls

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Feb 05 2003 - 13:40:09 EST

Next message: Otto Stolz: "Re: discovering code points with embedded nulls"

Previous message: Rick Cameron: "RE: discovering code points with embedded nulls"
Maybe in reply to: Erik.Ostermueller@alltel.com: "discovering code points with embedded nulls"
Next in thread: Otto Stolz: "Re: discovering code points with embedded nulls"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Erik Ostermueller wrote:
> I'm dealing with an API that claims it doesn't support
> unicode characters with embedded nulls.
> I'm trying to figure out how much of a liability this is.

If by "embedded nulls" they mean bytes of value zero, that library can
*only* work with UTF-8. The other two UTF's cannot be supported in this way.

But are you sure you understood clearly? Didn't they perhaps write "Unicode
*strings* with embedded nulls? In that case they could have meant that null
*characters* inside strings. I.e., they don't support strings containing the
Unicode character U+0000, because that code is used as a string terminator.
In this case, it would be a common and accepted limitation.

> What is my best plan of attack for discovering precisely
> which code points have embedded nulls
> given a particular encoding? Didn't find it in the maillist archive.
> I've googled for quite a while with no luck.

The question doesn't make sense. However:

UTF-8: Only one character is affected (U+0000 itself);

UTF-16: In range U+0000..U+FFFF (Basic Multilingual Plane), there are of
course exactly 511 characters affected (all those of form U+00xx or U+xx00),
484 of which are actually assigned. However, a few of these code points are
high or low surrogates, which means that also many characters in range
U+010000..U+10FFFF are affected.

UTF-32: All characters are affected, because the high byte of an UTF-32 unit
is always 0x00.

> I'll want to do this for a few different versions of unicode
> and a few different encodings.

Most single and double-byte encodings behave like UTF-8 (i.e., a single
zero-byte is only needed to encode U+0000 itself).

> What if I write a program using some of the data files
> available at unicode.org?
> Am I crazy (I'm new at this stuff) or am I getting warm?
> Perhaps this data file:
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ?
>
> Algorithm:
> INPUT: Name of unicode code point file
> INPUT: Name of encoding (perhaps UTF-8)
>
> Read code point from file.
> Expand code point to encoded format for the given encoding.
> Test all constituent bytes for 0x00.
> Goto next code point from file.

That would be totally useless, I am afraid.

The only UTF for which this count makes sense is UTF-8, and the result is
"one".

_ Marco

Next message: Otto Stolz: "Re: discovering code points with embedded nulls"
Previous message: Rick Cameron: "RE: discovering code points with embedded nulls"
Maybe in reply to: Erik.Ostermueller@alltel.com: "discovering code points with embedded nulls"
Next in thread: Otto Stolz: "Re: discovering code points with embedded nulls"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Feb 05 2003 - 14:20:42 EST