Re: Perception that Unicode is 16-bit (was: Re: Surrogate space i

From: Peter_Constable@sil.org
Date: Thu Feb 22 2001 - 12:37:19 EST


> What exactly _would_ be wrong with calling UNICODE a
> thirty-two bit encoding

In part, it's the ambiguity or lack of clarity involved when we say "an
encoding". What's an encoding? I think most people (I certainly used to)
think of a character encoding as a collection of characters each of which
is given some numeric, digital representation, and that a given encoding
assumes some particular datatype for those representations. Well, folks,
that's just too simplistic these days. Unicode is a character encoding
system that employs a model involving multiple levels of description and
representation, and there isn't a single datatype that applies to all those
levels. Indeed, some of the levels don't have *any* associated datatype.

Levels:

1. Character repertoire: just an ordered set of abstract characters without
any associated numbers, so no datatype associated

2. Coded character set: each of the abstract characters is assigned a
positive integer value, known as a Unicode scalar value. There is no
associated datatype, but the integers range from 0 to 0x10FFFF (these are
conventionally represented using U+ notation). It just happens that this
range can be directly represented in a 32-bit datatype, but it's just as
equally true that this range can be represented in a 512-bit datatype.
There is still no particular datatype associated with this level. (The
range can be represented in as little as 21 bits. 21 bits can actually
represent up to 0x1FFFFF, and so the range takes more than 20 bits but less
than 21 bits. Thus, it is sometimes said that the range requies 20.1 bits.
Still, strictly speaking this level has no associated datatype.)

3. Encoding form: At this level, each coded character in the CCS is
represented in a specific datatype, and so finally we can talk about
bit-widths. The problem is that there isn't just one bit-width that can be
used: Unicode provides encoding forms that use 8-, 16- or 32-bit datatypes
- take your pick.

The remaining levels don't add anything new.

So, is Unicode a 16-bit encoding? If we think of "an encoding" as having a
particular datatype, then the only way to answer the question is to talk in
terms of level 3. Thus, it's reasonable to say "yes" since it has a 16-bit
encoding form; but it's also reasonable to say that it is also an 8-bit or
a 32-bit encoding. On the other hand, if people think of the bit-width of
"an encoding" not as a specific datatype but as an indication of the range
of possible characters, then the range is determined at level 2, and if we
need to talk about that range in terms of bit-widths the answer is 20.1.

The real problem is that the old way of talking about these things no
longer works. New wine needs to go in new wine skins. If people really want
to understand Unicode and know "what it's bit-width is", then they need to
understand the details of the multi-level model. There's no way around it.

If you're talking to someone who just wants to get an idea of how many
characters it can support and doesn't care about the details, then try one
of these answers:

- "It's a 16-bit encoding." The advantage of this is that it's consistent
with what they might have heard in the past. The disadvantages are that it
severely underestimates the range of potential characters (OK if they're
not good with binary arithmetic), and it leads to lot's of
misunderstandings.

- "It's a 32-bit encoding." This better reflects the range of potential
characters -- more than we'll probably ever need. The disadvantages are
that it proabably differs from what they may have heard in the past, and it
can lead to similar kinds of misunderstandings as the previous answer (just
that everything's assumed to be 32 bits instead of 16 bits).

- "It supports over a million characters." This gives the best
characterization of the potential number of characters, and avoids the
misunderstandings that can arise from the previous two answers. The
disadvantage is that they may still want to hear a particular bit-width
named.

- "It's a 20.1-bit encoding." This is probably the answer that gives them a
bit-width while also sticking to some semblance of fact and also avoiding
the potential misunderstandings of the first two answers (20.1 bits clearly
doesn't correspond to a datatype). The "disadvantage" is that it will lead
to further questions that will require explaining the multi-level model.
But maybe that's not a disadvantage. Maybe the education needs to be
propogated more.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT