Re: What things are called (was Non-ascii string processing)

From: Doug Ewell (
Date: Tue Oct 07 2003 - 09:42:20 CST

Jill Ramonsky <Jill dot Ramonsky at Aculab dot com> wrote... Well, one
thing she wrote was:

> :-)

OK, that's out of the way. What follows is not necessarily 100%

> I have invented a new system, Unilib, for organising books in a
> library.
> ... Except that you're not allowed to call them "books" any more,
> because I've already redefined the word "book" to mean "the physical
> expression of a catalogue entry". Since what the user normally
> experiences as a book may actually require several catalogue entries,
> we can no longer use the word "book" for this object. Consequently,
> we need a new word or phrase to describe what the user normally
> experiences as a book. We tried calling them "volumes" back in Unilib
> 3.0, but it turned out that that word was also used for something
> else. So now we call them "default chapter clusters".

Actually, this is a great analogy to what is going on with Unicode
terminology, but probably not for the reason Jill had in mind.

There are plenty of examples of "books" as the user sees them that
contain one or more "books" as the author sees them. The Old and New
Testaments, and similar scriptural and philosophical material in many
belief systems, consist of many "books" that are bound together within a
hard cover. The Book of Genesis would be an awfully thin "book" if it
appeared on the shelf individually. Likewise, many great (and
not-so-great) literary works have been divided into "Book I" and "Book
II" by their authors.

This overloading of the word "book" can indeed lead to confusion and
misunderstanding, as when a high-school student with an assignment to
read and compare two books chooses "Book I" and "Book II" of the same
jointly bound work. When the Springfield Public Library takes an
inventory, they will probably continue to count each copy of the Bible
as one book, not as dozens.

My point is that Jill's Unilib didn't invent this confusion and

Likewise, any character encoding standard that incorporates the concept
of "combining characters" is bound to experience the same sort of
confusion and ambiguity over the term "character." This is not unique
to Unicode; ISO 6937 has this problem as well with its (leading)
non-spacing marks. In ISO 6937, <0x61> is <a>, while <0xC2 0x61> is
<รก>. Are both the one-byte and two-byte sequences "characters"? Does
that mean 0x61 is both a character in its own right and *part* of
another character? Do we need a separate word for whatever 0x61

Unicode greatly expanded the potential for this sort of complication, by
encoding all the lexical symbols (or whatever) of almost all modern
scripts and many archaic ones, and introducing many more types of
combining marks and interactions between them than any previous
character encoding. Unicode has also tried to reduce the confusion, by
introducing new terms. Sometimes the terms add confusion here as they
take it away there, but our only real alternative is to go back to the
days when we couldn't really talk about these things because they had no


-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST