What Unicode Is (was RE: Inappropriate Proposals FAQ)

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Jul 12 2002 - 18:54:31 EDT


Suzanne responded:

> > Maybe Unicode is more of a shared set of rules that apply to
> > low level data structures surrounding text and its algorithms
> > then a character set.
>
> Sounds like the start of a philosophical debate.
>
> If Unicode is described as a set of rules, we'll be in a world of hurt.

> (On a serious note, these exceptions are exactly what make writing some
> sort of "is and isn't" FAQ pretty darned hard.

Hmm. Since the discussion which started out trying to specify a
few examples of what kinds of entities would be inappropriate to
proffer for encoding as Unicode characters seems to be in danger
of mutating into the recurrent "What is Unicode?" question,
perhaps its time to start a new thread for the latter.

And now for some ontological ground rules.

When trying to decide what a "thing" is, it helps not to use
an attribute nominatively, since that encourages people to
privately visualize the noun the attribute is applied to,
but to do so in different ways -- and then to argue past each
other because they are, in the end, talking about different
things.

"Unicode" is used attributatively of a number of things, and
if we are going to start arguing/discussing what "it" is, it
would be better to lay out the alternative "it"s a little
more specifically first.

1. The Unicode *Consortium* is a standardization organization.
It started out with a charter to produce a single standard,
but along the way has expanded that charter, in response to
the desire of its membership. In addition to "The Unicode
Standard", it now has adopted a terminology that refers to
some of its other publications as "Unicode Technical Standards"
[UTS], of which two formally exist now: UTS #6 SCSU, and
UTS #10 Unicode Collation Algorithm [UCA].

It is important to keep this straight, because some people,
when they say "Unicode" are talking about the *organization*,
rather than the Unicode Standard per se. And when people talk
about "the standard", they are generally referring to "The
Unicode Standard", but the Unicode Consortium is actually
responsible for several standards.

2. The Unicode *Standard* itself is a very complex standard, consisting
of many pieces now. To keep track of just what something like
"The Unicode Standard, Version 3.2" means, we now have to
keep web pages enumerating all the parts exactly -- like
components in an assemble-your-own-furniture kit. See:
http://www.unicode.org/unicode/standard/versions/

In any one particular version, the Unicode Standard now consists
of a book publication, some number of web publications
(referred to as Unicode Standard Annexes [UAX]), and a
large number of contributory data files -- some normative and
some informative, some data and some documentation. These
definitions, including the exact list of contributory
data files and their versions, are themselves under tight
control by the Unicode Technical Committee, as they constitute
the very *definition* of the Unicode Standard. It is not
by accident that the version definitions start off now with
the following wording:

"The Unicode Standard, Version 3.2.0 is defined by the following
list..."

and so on for earlier versions.

3. The Unicode *Book* is a periodic publication, constituting the
central document for any given version of the Unicode *Standard*,
but is by no means the entire standard. The book, in turn,
is very complex, consisting of many chapters and parts, some
of which constitute tightly controlled, normative specification,
and some of which is informative, editorial content.

The "book" now also exists in an online version (pdf files):
http://www.unicode.org/unicode/uni2book/u2.html
which is *almost* identical to the published hardcover book,
but not quite. (The Introduction is slightly restructured,
the online glossary is restructured and has been added to,
the charts are constructed slightly differently and have
introductory pages of their own, etc.)

4. The Unicode *CCS* [coded character set] is the mapping of the
set of abstract characters contained in the Unicode repertoire
(at any given version) to a bunch of code points in the
Unicode codespace (0x0000..0x10FFFF). Technically speaking, it
is the Unicode *CCS* which is synchronized closely with
ISO/IEC 10646, rather than the Unicode *Standard*. 10646 and
the Unicode CCS have exactly the same coded characters (at
various key synchronization points in their joint publication
histories), but the *text* of the ISO/IEC 10646 standard doesn't
look anything like the *text* of the Unicode Standard, and the
Unicode Standard [sensum #2 above] contains all kinds of
material, both textual and data, that goes far beyond the scope
of 10646.

There are other standards produced by some national
bodies that are effectively just translations of 10646
(GB 13000 in China, JIS X 0221 in Japan), but the Unicode Standard
is nothing like those.

Finally, the attribute "Unicode ..." can be applied to all
kinds of other "things" characteristic of the Unicode Standard,
including algorithms for the manipulation of characters.
Obvious examples which come to mind are "Unicode Bidirectional
Algorithm", "Unicode Normalization", "Unicode Encoding Forms",
and "Unicode Character Properties".

O.k., so now before asserting or denying that "Unicode ... is
a shared set of rules", it would be helpful to pin down
first what you are referring to. That might make the ensuing
debate more fruitful.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Jul 12 2002 - 18:15:40 EDT