Re: Question...

From: Mark Davis (mark@macchiato.com)
Date: Sat May 08 1999 - 21:23:46 EDT


Some of the replies on this subject were true, but incomplete.

The real reason that we didn't skip the codes that contained NULL bytes is
because of the C standard itself, which requires that wchar_t values
corresponding to char be zero-extended. That is, a char value of 0x61 ('a')
must correspond to wchar_t value of 0x0061. This forces NULL bytes in the
lowest 256 Unicode values.

Mark

----- Original Message -----
From: Addison Phillips <AddisonP@simultrans.com>
To: Unicode List <unicode@unicode.org>
Cc: <unicode@unicode.org>; Niket Patwardhan <npatward@verity.com>
Sent: Thursday, May 06, 1999 5:32 PM
Subject: RE: Question...

> Hi Joon,
>
> You are thinking in "narrow" (8-bit) terms!
>
> It's only a NULL because you're used to thinking of text as an array of
> bytes. If you think of Unicode text as an array of words then you'll see
> why there is only ONE null character in Unicode...
>
> In any case, by using a uniform architecture (all characters are 16 bits,
> at least those in the BMP are), all the tricks you used with char * work
> with w_char *. Your pointer arithmetic. Everything.
>
> It is messy to skip over every 256th code point just because half of the
> character is an ASCII NULL. If you're going to do UCS-2, then write your
> code to UCS-2 and you won't have this problem..
>
> Jonathan Coxhead was correct to point out that IP will transmit a "NULL"
as
> part of the data stream just fine. If the problem is that you are trying
to
> disassemble IP packets on the far side by operating a byte-at-a-time on a
> char * array of data received, well, there are a number of workarounds
that
> get you a w_char * or parse the packet properly so that this is not an
> issue. Working on Unicode data as if it were an array of single-byte crud
> is a recipe for disaster (and this is why many lexers break when doing a
> Unicode conversion project). Unlike some other character sets, Unicode is
> not stateful and is uniform throughout. Proper Unicode-aware code uses
this
> to advantage.
>
> If you'd like Bill or I can come over and show you what to do. If you're
in
> Mountain View then we're like across the street.
>
> Thanks,
>
> Addison
>
> -----Original Message-----
> From: Joon Kang [mailto:ykang@verity.com]
> Sent: jeudi 6 mai 1999 16:47
> To: Addison Phillips
> Cc: unicode@unicode.org; Yongchul Kang; Niket Patwardhan;
> airwolf@verity.com
> Subject: Re: Question...
>
>
> Hello Addison,
>
> Thank you for your answering my question.
> I am a Win-guru but our products support various platforms.
>
> Anyway, my original curiosity comes from since UCS2 has 65K code points,
it
> could adopt almost Asian characters without having one byte NULL.
> If I restate my question then, why UCS2 allows the one byte NULL on its
> codeset design/implementation?
> Is it simply just because to provide codepoint integrity or ?
>
> Thanks in advance for your kind reply.
> -Joon
> International Engineering at Verity, Inc.
>
>
> ----- Original Message -----
> From: Addison Phillips <AddisonP@simultrans.com>
> To: <ykang@verity.com>
> Cc: <unicode@unicode.org>
> Sent: Thursday, May 06, 1999 4:17 PM
> Subject: RE: Question...
>
>
> >Hi Joon,
> >
> >By "NULL represent a character", do you mean that character NULL 0x0000?
Or
> >do you mean NULL has a half character (e.g. one byte of NULL such as
0x0032
> >or 0x3200)?
> >
> >If the former, the NULL character gives compatibility for programs
written
> >in C that use NULL as a string terminator.
> >If the latter, NULL represents half of the character and cannot (well,
> >should not) be eliminated without changing the encoding.
> >
> >If you wish to send data that contains no nulls (other than string
> >terminators), consider converting your data to UTF-8 or UTF-7. The
> >particular encodings were created for file-system and mail-system safety
in
> >particular and have the advantage of containing no NULLs. UTF-8 is
commonly
> >used for compatibility purposes such as this.
> >
> >The downside of using these encodings is that, while they "compress"
> English
> >UNICODE data, your Asian characters will average MORE THAN 2 bytes per
> >character.
> >
> >The encoding function for UTF-7 and UTF-8 are out on the web and both are
> >small functions.
> >The definition and encoding for UCS-2 is on Unicode's web site (well,
> >conversion tables are... the definition is long and you should buy the
> >book). You can find some sample code that might help on our FTP site
> >(ftp://ftp.simultrans.com/anonymous). You're not going to write your own
> >character converter, though, are you? I thought you guys were a Windows
> >shop?
> >
> >Addison
> > __________________________________________
> >
> > Addison Phillips
> > Director, Globalization Services
> > SimulTrans, L.L.C.
> > 2606 Bayshore Parkway
> > Mountain View, California 94043 USA
> >
> > +1 650-526-4652 (direct telephone)
> > +1 650-969-9959 (facsimile)
> > AddisonP@simultrans.com (Internet email)
> > http://www.simultrans.com (website)
> >
> > "22 languages. One release date."
> > __________________________________________
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:46 EDT