Abnormal Bytes and Unicode: (was Re: Unicode FAQ addendum)

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 24 2000 - 15:05:23 EDT


Paul Keinanen wrote:

> At 16.10 22.7.2000 -0800, jgo wrote:
> >> Addison wrote:
> >> 1. 1 byte != 1 character: deal with it.
> >
> >Hmm, depends on how you define "byte".
> >I've seen them in 8-bit, 12-bit, 16-bit and 18-bit varieties.
> >
> >The trouble, though, is that 1 character (in this context)
> >can be represented by from 16 bits to 6*16 bits.
>
> At least in the C99 proposal for the C-language, the "char" data type shoud
> be big enough to fit one character of the execution character set. A byte
> was defined as the smallest directly addressable unit, but the byte data
> type is not directly available in the C-language.

Except, of course, that it has to be available in all current significant
software, which is why you typically find definitions like the following
in the fundamental header files of most big systems:

typedef unsigned char BYTE;

And if you actually tried to define the "char" to be 32-bits wide, i.e.,
big enough to fit one character, when using UTF-32 as your execution
character set, all hell would proceed to break loose in every system
I've ever seen.

The C language definition tried to abstract away datatype size differences
in the various machine architectures available in the early days. But
the model is obsolete--dead as the days when all those heterogenous computer
architectures existed in isolated little worlds by themselves. Now we
have massively interconnected systems sharing data and bytecode. They *have*
to agree on their datatype sizes.

So the first step to interoperability in big, interconnected system
software using C is to set up fundamental header files containing
well-defined datatypes of fixed sizes, to make up for the lack of same
in the definition of C itself. The lack of fixed-size datatypes in C
is now a *defect* in the language, and not an *asset* of the language.

>
> Assuming an old 36 bit mainframe with smallest addressable unit of one 36
> bit word and UCS-2 or UTF-16 as the execution character set and the hardware
> containing some half word (18 bit) instructions, thus one could define byte
> as 36 bits and "char" data type as 18 bits. Can anyone verify if this is
> legal according to the official version of the C-language ?

If this were a serious example, and
If such a system weren't already hard-wired to some EBCDIC or other, and
If someone undertook the foolish project to backport Unicode onto a system which
    had no supporting infrastructure, ...

The right choice in this case would be to use UTF-32 and store each character
in one machine word.

>
> While this is an artificial example, more up to date DSP and RISC processors
> may show quite strange data sizes.
>
> It would be better to talk about octets if 8 bit quantities are referred.

There is a subcommunity in standards that insists on this, but for most
people in programming, there is no point to it.

For 99% of the programming done in C (and most other current languages),
a "byte" *is* an 8-bit datatype. (And "octet" is an exotic term used
in character standards to confuse people who already know what a byte is. ;-) )

Note that in a *modern* language like Java, the size and signedness of
the primitive datatypes *is* well-defined:

boolean = 1 bit
char = 16 bits (unsigned)
byte = 8 bits
short = 16 bits
int = 32 bits
long = 64 bits

>
> Paul Keinanen
>
> P.S.
>
> If 64 bit RISC processors (e.g. IA-64 architecture from Intel) becomes more
> common, do we need UTF-64 to store three 21 bit Unicode characters (0000 ..
> 10FFFF) into a single 64 bit word (with one unused bit) ?

No, of course not.

I write a UTF-16 Unicode library using C. It is routinely ported to 64 bit processors,
and has been running just fine on them for years now, thank you. They also
process UTF-8 (using BYTEs, I might add) with no trouble.

For a system that wanted to implement UTF-32, it would port easily to
64 bit processors as well. In storage UTF-32 characters take up 32 bits,
just like any other 32-bit datatype. In transient processing they just go
in and out of the 64 bit registers like any other 32-bit datatype.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT