Re: Perception that Unicode is 16-bit (was: Re: Surrogate space i

From: Tex Texin (
Date: Thu Feb 22 2001 - 17:35:56 EST

Peter, good points.
What's clear from this discussion, is when somebody asks about
the encoding of Unicode, the right response is "Why do you want to
know?" not this elaboration of terminology etc.

If they want to know maximum character count, tell them 1M+.

If they want to know whether it's compatible with their API, you
can tell them they have a choice of 8, 16 or 32 bit models.

If they want to know how difficult it is to work with relative
to single byte, double-byte or other standards they are
familiar with, then there needs to be some setting of context for
their application and some mention of
the encoding alternatives, surrogates, and normalization.
Depending on their application or prior experience, they
possibly need some side trips on collation and
bidi, and fonts.

Then they certainly need some reminder of the benefits of Unicode.

People often don't ask the question that they really
need answered.

An answer of 20.1 bits, while amusing to us, is confounding to
the rest of the world. There is a time and place for precision
(for example, the standard itself should make accurate use of the
terminology that is being wrestled with below), but there is also an
appropriate time and place for big picture without so much

So, the right answer to
"What exactly _would_ be wrong with calling UNICODE a
thirty-two bit encoding?"

is, "If I tell you yes what will it mean or imply to you and if
I tell you no, what will that mean, and maybe can you tell me
more about why you are asking that?"

tex wrote:
> > What exactly _would_ be wrong with calling UNICODE a
> > thirty-two bit encoding
> In part, it's the ambiguity or lack of clarity involved when we say "an
> encoding". What's an encoding? I think most people (I certainly used to)
> think of a character encoding as a collection of characters each of which
> is given some numeric, digital representation, and that a given encoding
> assumes some particular datatype for those representations. Well, folks,
> that's just too simplistic these days. Unicode is a character encoding
> system that employs a model involving multiple levels of description and
> representation, and there isn't a single datatype that applies to all those
> levels. Indeed, some of the levels don't have *any* associated datatype.
> Levels:
> 1. Character repertoire: just an ordered set of abstract characters without
> any associated numbers, so no datatype associated
> 2. Coded character set: each of the abstract characters is assigned a
> positive integer value, known as a Unicode scalar value. There is no
> associated datatype, but the integers range from 0 to 0x10FFFF (these are
> conventionally represented using U+ notation). It just happens that this
> range can be directly represented in a 32-bit datatype, but it's just as
> equally true that this range can be represented in a 512-bit datatype.
> There is still no particular datatype associated with this level. (The
> range can be represented in as little as 21 bits. 21 bits can actually
> represent up to 0x1FFFFF, and so the range takes more than 20 bits but less
> than 21 bits. Thus, it is sometimes said that the range requies 20.1 bits.
> Still, strictly speaking this level has no associated datatype.)
> 3. Encoding form: At this level, each coded character in the CCS is
> represented in a specific datatype, and so finally we can talk about
> bit-widths. The problem is that there isn't just one bit-width that can be
> used: Unicode provides encoding forms that use 8-, 16- or 32-bit datatypes
> - take your pick.
> The remaining levels don't add anything new.
> So, is Unicode a 16-bit encoding? If we think of "an encoding" as having a
> particular datatype, then the only way to answer the question is to talk in
> terms of level 3. Thus, it's reasonable to say "yes" since it has a 16-bit
> encoding form; but it's also reasonable to say that it is also an 8-bit or
> a 32-bit encoding. On the other hand, if people think of the bit-width of
> "an encoding" not as a specific datatype but as an indication of the range
> of possible characters, then the range is determined at level 2, and if we
> need to talk about that range in terms of bit-widths the answer is 20.1.
> The real problem is that the old way of talking about these things no
> longer works. New wine needs to go in new wine skins. If people really want
> to understand Unicode and know "what it's bit-width is", then they need to
> understand the details of the multi-level model. There's no way around it.
> If you're talking to someone who just wants to get an idea of how many
> characters it can support and doesn't care about the details, then try one
> of these answers:
> - "It's a 16-bit encoding." The advantage of this is that it's consistent
> with what they might have heard in the past. The disadvantages are that it
> severely underestimates the range of potential characters (OK if they're
> not good with binary arithmetic), and it leads to lot's of
> misunderstandings.
> - "It's a 32-bit encoding." This better reflects the range of potential
> characters -- more than we'll probably ever need. The disadvantages are
> that it proabably differs from what they may have heard in the past, and it
> can lead to similar kinds of misunderstandings as the previous answer (just
> that everything's assumed to be 32 bits instead of 16 bits).
> - "It supports over a million characters." This gives the best
> characterization of the potential number of characters, and avoids the
> misunderstandings that can arise from the previous two answers. The
> disadvantage is that they may still want to hear a particular bit-width
> named.
> - "It's a 20.1-bit encoding." This is probably the answer that gives them a
> bit-width while also sticking to some semblance of fact and also avoiding
> the potential misunderstandings of the first two answers (20.1 bits clearly
> doesn't correspond to a datatype). The "disadvantage" is that it will lead
> to further questions that will require explaining the multi-level model.
> But maybe that's not a disadvantage. Maybe the education needs to be
> propogated more.
> - Peter
> ---------------------------------------------------------------------------
> Peter Constable
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <>

According to Murphy, nothing goes according to Hoyle.
Tex Texin                      Director, International Business      +1-781-280-4271 Fax:+1-781-280-4655
Progress Software Corp.        14 Oak Park, Bedford, MA 01730 #1 Embedded Database

Globalization Program ---------------------------------------------------------------------------

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT