Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in Unicode)

From: Kenneth Whistler (
Date: Tue Feb 20 2001 - 22:32:12 EST

Paul Keinšnen said:

> >[86-M8] Motion: Amend Unicode 3.1 to change the Chapter 3, C1 conformance
> >clause to read "A process shall interpret Unicode code units (values) in
> >accordance with the Unicode transformation format used." (passed)
> While this wording makes it possible to handle any 32 bit character
> API implementation as UTF-32, this wording does not make it any easier
> to implement it on processors with an exotic word length. Depending
> how "process" is defined, but a character API implementation on a 24
> bit computer using one word/character could be non-conformant, even if
> the 24 bits (or even 21 bit :-) would be more than sufficient to
> support the 0 .. 10FFFF range.

To the contrary--nothing in the wording of UTF-32 prevents an implementation
in 24-bit words on a processor that uses such words.

The basic definitions of UTF-32 are talking about *serialization*, in
which case you are talking about sequences of 4 (8-bit) bytes, and
the three encoding schemes: UTF-32BE, UTF-32LE, and UTF-32. This is
serialization for interchange of data.

As an encoding *form* (i.e. not serialized, but instead with characters
represented in computer datatypes), the assumption is that each
Unicode scalar value will be represented in a 32-bit word, since that
is the most common architecture that people would be using. But
nothing would prevent putting them in 64-bit registers, for example,
or 24-bit registers (since they fit).

The only thing you need to watch out for is that if you *public*
a UTF-32 API outside of a self-contained environment, you better
make sure that it is using unsigned 32-bit integers, as that is
the expectation that would be required for interoperating with other
systems. But the same caution would apply to any public API involving
integral datatypes -- you cannot willy-nilly pass integral data
between a 32-bit API and a 24-bit API.

> It would have been clearer that C1 would only define that code points
> in the 0 .. 10FFFF range should be supported,

That is everywhere implied in the Unicode Standard. There *are* no code points
beyond 10FFFF.

> allowing character API
> implementations (such as dynamically loadable libraries as separate
> products) for processors with exotic word lengths

Allowed. Although I suppose we should add a note in the future pointing
out that 64-bit and 24-bit implementations are to be expected, although
not in a public API that claims it is "UTF-32".


> and in a separate
> clause say something about the transformation formats.

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT