Hmm, this evolved into an editorial when I wasn't looking :) was: RE: Inappropriate Proposals FAQ

From: Barry Caplan (
Date: Fri Jul 12 2002 - 18:41:02 EDT

At 05:13 PM 7/12/2002 -0400, Suzanne M. Topping wrote:
>> -----Original Message-----
>> From: Barry Caplan []
>> At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
>> >Unicode is a character set. Period.
>> Each character has numerous
>> properties in Unicode, whereas they generally don't in legacy
>> character sets.
>Each character, or some characters?

For all intents and purposes, each character. Chapter 4.5 of my Unicode 3.0 book says " The Unicode Character Database on the CDROM defines a General Category for all Unicode characters"

So, each character has at least one attribute. One could easily say that each character also has an attribute for "isUpperCase" of either true of false, and so on.

There are no corresponding features in other character sets usually.

>> Maybe Unicode is more of a shared set of rules that apply to
>> low level data structures surrounding text and its algorithms
>> then a character set.
>Sounds like the start of a philosophical debate.

Not really. I have been giving presentations for years, and I have seen many others give similar presentations. A common definition of "character set" is a list of character you are interested in assigned to codepoints. That fits most legacy character sets pretty well, but Unicode is sooo much more than that.

>If Unicode is described as a set of rules, we'll be in a world of hurt.

Yeah, one of the heaviest books I own is Unicode 3.0. I keep it on a low shelf so the book of rules describing Unicode doesn't fall on me for just that reason. this is earthquake country after all :)

>I choose to look at this stuff as the exceptions that make the rule.

I don't really know if it is possible to break down Unicode into more fundamental units if you started over. Its complexity is inherent in the nature of the task. My own interest is more in getting things done with data and algorithms that use the type of material represented by the Unicode standard, more so than the arcania of the standard itself. So it doesn't bother me so much that there are exceptions - as long as we have the exceptions that everyone agrees on, that is fine by me because it means my data and at least some of my algorithms are likely to be preservable across systems.

>(On a serious note, these exceptions are exactly what make writing some
>sort of "is and isn't" FAQ pretty darned hard.

Be careful what you ask for :)

>I can't very well say
>that Unicode manipulates characters given certain historical/legacy
>conditions and under duress.

Why not? It is true.

But what if we took a look at it from a different point of view, that the standard is a agreed upon set of rules and building blocks for text oriented algorithms? Would people start to publish algorithms that extend on the base data provided so we don't have to reinvent wheels all the time?

I'm just brainstorming here, this is all just coming to me now.....

If I were to stand in front of a college comp sci class, where the future is all ahead of the students, what proportion of time would I want to invest in how much they knew about legacy encodings versus how much I could inspire them to build from and extend what Unicode provides them?

Seriously, most of the folks on this list that I know personally, and I include myself in this category, are approaching or past the halfway point in our careers. What would we want the folks who are just starting their careers now to know about Unicode and do with it by the time they reach the end of theirs, long after we have stopped working?

For many applications, people are not going to specialize in i18n/l10n issues. They need to know what the appropriate building text based blocks are, and how they can expand on them while still building whatever they are working on.

Unicode at least hints at this with the bidi algorothm. Moving forward should other algorithms be codified into Unicode, or as separate standards or defacto standards? I am thinking of "Japanese word splitting algorithm". There are proprietary products that do this today with reasonable but not perfect results. Are they good enough that the rules can be encoded into a standard? If so, then someone would build an open implementation, and then there would always be this building block available for people to use.

I am sure everyone on this list can think of their own favorite algorithms of this type, based on the part of Unicode that interests you the most. My point is that the raw information already in unicode *does* suggest the next level of usage, and the repeated newbie questions that inspired this thread suggest the need for a comprehensive solution at a higher level then a character set provides. Maybe part of this means including or at least facilitating the description of lowlevel text handling algorithms.

>If I did, people would be scurrying around
>trying to figure out how to foment the duress.)

The accomplishments of the Unicode consortium are really nothing less than spectacular. Period.

Soon, Unicode text handling at a raw byte level will be built into all important operating systems. This is probably the point at which early Unicoders may have predicted they would die and go to heaven if it ever happened. It was way out there beyond the conceivable horizon. Well, it is going to happen soon.

Then what?

What about the millions of people who are asking "What is this Unicode anyway?" They don't want to worry about the arcania, they want to understand how a feature of the OS affects their work, and how they can exploit it.

The major work ahead is no longer in the context of building a character standard. Time is fast approaching to decide to keep it small and apply a bit of polish, or focus on the use and usage of what is already there in Unicode by those who have never considered it before. .

I think the former is necessary (the Standard is not finished) but to reach way out beyond the conceivable horizon again, the bulk of the effort should move towards the latter.

Barry Caplan

This archive was generated by hypermail 2.1.2 : Fri Jul 12 2002 - 17:59:28 EDT