RE: Question...

From: Addison Phillips (
Date: Thu May 06 1999 - 20:38:39 EDT

Hi Joon,

You are thinking in "narrow" (8-bit) terms!

It's only a NULL because you're used to thinking of text as an array of
bytes. If you think of Unicode text as an array of words then you'll see
why there is only ONE null character in Unicode...

In any case, by using a uniform architecture (all characters are 16 bits,
at least those in the BMP are), all the tricks you used with char * work
with w_char *. Your pointer arithmetic. Everything.

It is messy to skip over every 256th code point just because half of the
character is an ASCII NULL. If you're going to do UCS-2, then write your
code to UCS-2 and you won't have this problem..

Jonathan Coxhead was correct to point out that IP will transmit a "NULL" as
part of the data stream just fine. If the problem is that you are trying to
disassemble IP packets on the far side by operating a byte-at-a-time on a
char * array of data received, well, there are a number of workarounds that
get you a w_char * or parse the packet properly so that this is not an
issue. Working on Unicode data as if it were an array of single-byte crud
is a recipe for disaster (and this is why many lexers break when doing a
Unicode conversion project). Unlike some other character sets, Unicode is
not stateful and is uniform throughout. Proper Unicode-aware code uses this
to advantage.

If you'd like Bill or I can come over and show you what to do. If you're in
Mountain View then we're like across the street.



-----Original Message-----
From: Joon Kang []
Sent: jeudi 6 mai 1999 16:47
To: Addison Phillips
Cc:; Yongchul Kang; Niket Patwardhan;
Subject: Re: Question...

Hello Addison,

Thank you for your answering my question.
I am a Win-guru but our products support various platforms.

Anyway, my original curiosity comes from since UCS2 has 65K code points, it
could adopt almost Asian characters without having one byte NULL.
If I restate my question then, why UCS2 allows the one byte NULL on its
codeset design/implementation?
Is it simply just because to provide codepoint integrity or ?

Thanks in advance for your kind reply.
International Engineering at Verity, Inc.

----- Original Message -----
From: Addison Phillips <>
To: <>
Cc: <>
Sent: Thursday, May 06, 1999 4:17 PM
Subject: RE: Question...

>Hi Joon,
>By "NULL represent a character", do you mean that character NULL 0x0000? Or
>do you mean NULL has a half character (e.g. one byte of NULL such as 0x0032
>or 0x3200)?
>If the former, the NULL character gives compatibility for programs written
>in C that use NULL as a string terminator.
>If the latter, NULL represents half of the character and cannot (well,
>should not) be eliminated without changing the encoding.
>If you wish to send data that contains no nulls (other than string
>terminators), consider converting your data to UTF-8 or UTF-7. The
>particular encodings were created for file-system and mail-system safety in
>particular and have the advantage of containing no NULLs. UTF-8 is commonly
>used for compatibility purposes such as this.
>The downside of using these encodings is that, while they "compress"
>UNICODE data, your Asian characters will average MORE THAN 2 bytes per
>The encoding function for UTF-7 and UTF-8 are out on the web and both are
>small functions.
>The definition and encoding for UCS-2 is on Unicode's web site (well,
>conversion tables are... the definition is long and you should buy the
>book). You can find some sample code that might help on our FTP site
>( You're not going to write your own
>character converter, though, are you? I thought you guys were a Windows
> __________________________________________
> Addison Phillips
> Director, Globalization Services
> SimulTrans, L.L.C.
> 2606 Bayshore Parkway
> Mountain View, California 94043 USA
> +1 650-526-4652 (direct telephone)
> +1 650-969-9959 (facsimile)
> (Internet email)
> (website)
> "22 languages. One release date."
> __________________________________________

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT