Re: Double Byte enabled

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Apr 05 2000 - 21:39:00 EDT


Suzanne,

> "Unicode enabled" is probably the clearest term, but I would appreciate
> comments on historic use of "multibyte enabled".

"multibyte enabled" was simply the extension of the term "doublebyte enabled"
for Chinese character sets, which overspilled the bounds of a two-byte
encoding. EUC-CNS has 1-, 2-, 3-, and 4-byte forms for characters.
(See Ken Lunde's great book for details on all of this.)

But the "multibyte" character sets are otherwise structured just like
the "doublebyte" character sets -- they have single-byte ASCII, they
avoid control byte values and coexist with 2022 constraints, they have
lead byte and trail byte ranges (which may or may not overlap), and
fundamentally, they are all bound to a "char" (8-bit) datatype.

Unicode is a different animal altogether. In its UTF-16 encoding form,
it is bound to a "wyde" (unsigned short, 16-bit) datatype. It does not have
single-byte ASCII, and so on. And, in contrast to all other
character encodings, it is *the* universal character encoding.

Doublebyte and multibyte enabling was a matter of breaking the
ASCII equation of character=byte, to allow the possibility that
character=byte,byte,byte... Significantly, it also often involved
rewriting the code for "character set independence", so that no
process assumed that any character byte value meant anything
in particular (at least outside the range of ASCII).

Unicode enabling is a matter of breaking the ASCII equation of
character = DATATYPE(char), and reworking the code to understand
that character=wyde[,wyde]. By contrast with doublebyte enabling,
Unicode enabling can go the other way, to eschew character set
independence altogether, since all processes can in fact know *exactly*
what each particular short value in Unicode data means.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT