Re: Hmm, this evolved into an editorial when I wasn't looking :) was: RE: Inappropriate Proposals FAQ

From: Kenneth Whistler (
Date: Fri Jul 12 2002 - 21:34:28 EDT

Barry Caplan wrote:

> >> At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
> >> >Unicode is a character set. Period.
> >>
> >> Each character has numerous
> >> properties in Unicode, whereas they generally don't in legacy
> >> character sets.
> >
> >Each character, or some characters?
> For all intents and purposes, each character.
> So, each character has at least one attribute.

Yes. The implications of the Unicode Character Database include
the determination that the UTC has normatively assigned properties
(multiple) to all Unicode encoded characters.

Actually, it is a little more subtle than that. There are some
properties which accrue to code points. The General Category and
the Bidirectional Category are good examples, since they constitute
enumerated partitions of the entire codespace, and API's need to
return meaningful values for any code point, including unassigned ones.

Other properties accrue more directly to characters, per se.
They attach to the abstract character, and get associated with
a code point more indirectly by virtue of the encoding of that
character. The numeric value of a character would be a good example
of this. No one expects an unassigned code point or an assigned
dingbat character or a left bracket to have a numeric value property
(except perhaps a future generation of Unicabbalists).
> There are no corresponding features in other character sets usually.

Correct. Before the development of the Unicode Standard, character
encoding committees tended to leave that property assignments
either up to implementations (considering them obvious) or up
to standardization committees whose charter was "character
processing" -- e.g. SC22/WG15 POSIX in the ISO context.

The development of a Universal character encoding necessitated
changing that, bringing character property development and
standardization under the same roof as character encoding.

Note that not everyone agrees about that, however. We are
still having some rather vigorous disagreements in SC22 about
who "owns" the problem of standardization of character properties.

> A common definition of "character set" is a list of character
> you are interested in assigned to codepoints. That fits most
> legacy character sets pretty well, but Unicode is sooo much
> more than that.

Roughly the distinction I was drawing between "the Unicode CCS"
and "the Unicode Standard".

> But what if we took a look at it from a different point of view,
> that the standard is a agreed upon set of rules and building
> blocks for text oriented algorithms? Would people start to
> publish algorithms that extend on the base data provided so
> we don't have to reinvent wheels all the time?

Well the "Unicode Standard" isn't that, although it contains
both formal and informal algorithms for accomplishing various
tasks with text, and even more general "guidelines" for how to
do things.

The members of the Unicode Technical Committee are always
casting about for areas of Unicode implementation behavior
where commonly defined, public algorithms would be mutually
beneficial for everyone's implementations and would assist
general interoperability with Unicode data.

To date, it seems to me that the members, as well as other
participants in the larger effort of implementing the Unicode
Standard, have been rather generous in contributing time
and brainpower to this development of public algorithms. The
fact that ICU is an Open Source development effort is enormously
helpful in this regard.

> If I were to stand in front of a college comp sci class,
> where the future is all ahead of the students, what proportion
> of time would I want to invest in how much they knew about legacy
> encodings versus how much I could inspire them to build from and
> extend what Unicode provides them?

This problem, of Unicode in the computer science curriculum,
intrigues me -- and I don't think it has received enough attention
on this list.

One of my concerns is that even now it seems to be that CS
curricula not only don't teach enough about Unicode -- they basically
don't teach much about characters, or text handling, or anything
in the field of internationalization. It just isn't an area that
people get Ph.D.'s in or do research in, and it tends to get
overlooked in people's education until they go out, get a job
in industry and discover that in the *real* world of software
development, they have to learn about that stuff to make software
work in real products. (Just like they have to do a lot of
seat-of-the-pants learning about a lot of other topics: building,
maintaining, and bug-fixing for large, legacy systems; software
life cycle; large team cooperative development process;
backwards compatibility -- almost nothing is really built from

> The major work ahead is no longer in the context of building
> a character standard. Time is fast approaching to decide to keep
> it small and apply a bit of polish, or focus on the use and usage
> of what is already there in Unicode by those who have never
> considered it before. .
> I think the former is necessary (the Standard is not finished)
> but to reach way out beyond the conceivable horizon again, the
> bulk of the effort should move towards the latter.

I think you may be underestimating the sheer bulk of the remaining
encoding task at hand -- to make Unicode meet its promise of
universality. But I share your sense that for most purposes,
for most people, the Unicode Standard is already a rich and open
opportunity, and that we should be focussing on how to roll
out better, more complete implementations that will go well
beyond people's expectations.


> Barry Caplan

This archive was generated by hypermail 2.1.2 : Fri Jul 12 2002 - 19:46:00 EDT