Some of the replies on this subject were true, but incomplete.
The real reason that we didn't skip the codes that contained NULL bytes is
because of the C standard itself, which requires that wchar_t values
corresponding to char be zero-extended. That is, a char value of 0x61 ('a')
must correspond to wchar_t value of 0x0061. This forces NULL bytes in the
lowest 256 Unicode values.
----- Original Message -----
From: Addison Phillips <AddisonP@simultrans.com>
To: Unicode List <email@example.com>
Cc: <firstname.lastname@example.org>; Niket Patwardhan <email@example.com>
Sent: Thursday, May 06, 1999 5:32 PM
Subject: RE: Question...
> Hi Joon,
> You are thinking in "narrow" (8-bit) terms!
> It's only a NULL because you're used to thinking of text as an array of
> bytes. If you think of Unicode text as an array of words then you'll see
> why there is only ONE null character in Unicode...
> In any case, by using a uniform architecture (all characters are 16 bits,
> at least those in the BMP are), all the tricks you used with char * work
> with w_char *. Your pointer arithmetic. Everything.
> It is messy to skip over every 256th code point just because half of the
> character is an ASCII NULL. If you're going to do UCS-2, then write your
> code to UCS-2 and you won't have this problem..
> Jonathan Coxhead was correct to point out that IP will transmit a "NULL"
> part of the data stream just fine. If the problem is that you are trying
> disassemble IP packets on the far side by operating a byte-at-a-time on a
> char * array of data received, well, there are a number of workarounds
> get you a w_char * or parse the packet properly so that this is not an
> issue. Working on Unicode data as if it were an array of single-byte crud
> is a recipe for disaster (and this is why many lexers break when doing a
> Unicode conversion project). Unlike some other character sets, Unicode is
> not stateful and is uniform throughout. Proper Unicode-aware code uses
> to advantage.
> If you'd like Bill or I can come over and show you what to do. If you're
> Mountain View then we're like across the street.
> -----Original Message-----
> From: Joon Kang [mailto:firstname.lastname@example.org]
> Sent: jeudi 6 mai 1999 16:47
> To: Addison Phillips
> Cc: email@example.com; Yongchul Kang; Niket Patwardhan;
> Subject: Re: Question...
> Hello Addison,
> Thank you for your answering my question.
> I am a Win-guru but our products support various platforms.
> Anyway, my original curiosity comes from since UCS2 has 65K code points,
> could adopt almost Asian characters without having one byte NULL.
> If I restate my question then, why UCS2 allows the one byte NULL on its
> codeset design/implementation?
> Is it simply just because to provide codepoint integrity or ?
> Thanks in advance for your kind reply.
> International Engineering at Verity, Inc.
> ----- Original Message -----
> From: Addison Phillips <AddisonP@simultrans.com>
> To: <firstname.lastname@example.org>
> Cc: <email@example.com>
> Sent: Thursday, May 06, 1999 4:17 PM
> Subject: RE: Question...
> >Hi Joon,
> >By "NULL represent a character", do you mean that character NULL 0x0000?
> >do you mean NULL has a half character (e.g. one byte of NULL such as
> >or 0x3200)?
> >If the former, the NULL character gives compatibility for programs
> >in C that use NULL as a string terminator.
> >If the latter, NULL represents half of the character and cannot (well,
> >should not) be eliminated without changing the encoding.
> >If you wish to send data that contains no nulls (other than string
> >terminators), consider converting your data to UTF-8 or UTF-7. The
> >particular encodings were created for file-system and mail-system safety
> >particular and have the advantage of containing no NULLs. UTF-8 is
> >used for compatibility purposes such as this.
> >The downside of using these encodings is that, while they "compress"
> >UNICODE data, your Asian characters will average MORE THAN 2 bytes per
> >The encoding function for UTF-7 and UTF-8 are out on the web and both are
> >small functions.
> >The definition and encoding for UCS-2 is on Unicode's web site (well,
> >conversion tables are... the definition is long and you should buy the
> >book). You can find some sample code that might help on our FTP site
> >(ftp://ftp.simultrans.com/anonymous). You're not going to write your own
> >character converter, though, are you? I thought you guys were a Windows
> > __________________________________________
> > Addison Phillips
> > Director, Globalization Services
> > SimulTrans, L.L.C.
> > 2606 Bayshore Parkway
> > Mountain View, California 94043 USA
> > +1 650-526-4652 (direct telephone)
> > +1 650-969-9959 (facsimile)
> > AddisonP@simultrans.com (Internet email)
> > http://www.simultrans.com (website)
> > "22 languages. One release date."
> > __________________________________________
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT