Re: Unicode FAQ addendum

From: Paul Keinänen (keinanen@sci.fi)
Date: Sun Jul 23 2000 - 14:40:04 EDT


At 16.10 22.7.2000 -0800, jgo wrote:
>> Addison wrote:
>> 1. 1 byte != 1 character: deal with it.
>
>Hmm, depends on how you define "byte".
>I've seen them in 8-bit, 12-bit, 16-bit and 18-bit varieties.
>
>The trouble, though, is that 1 character (in this context)
>can be represented by from 16 bits to 6*16 bits.

At least in the C99 proposal for the C-language, the "char" data type shoud
be big enough to fit one character of the execution character set. A byte
was defined as the smallest directly addressable unit, but the byte data
type is not directly available in the C-language.

Assuming an old 36 bit mainframe with smallest addressable unit of one 36
bit word and UCS-2 or UTF-16 as the execution character set and the hardware
containing some half word (18 bit) instructions, thus one could define byte
as 36 bits and "char" data type as 18 bits. Can anyone verify if this is
legal according to the official version of the C-language ?

While this is an artificial example, more up to date DSP and RISC processors
may show quite strange data sizes.

It would be better to talk about octets if 8 bit quantities are referred.

Paul Keinanen

P.S.

If 64 bit RISC processors (e.g. IA-64 architecture from Intel) becomes more
common, do we need UTF-64 to store three 21 bit Unicode characters (0000 ..
10FFFF) into a single 64 bit word (with one unused bit) ?
 



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT