More rambling about Han

From: Joel Rees (
Date: Wed Feb 21 2001 - 02:16:15 EST

Hi again.

The reason I fudge to March/April is that it doesn't do us any good if we
can't see what's in the Han section of extension B right now. (I just
checked this morning and the charts for the Han do not yet seem to be
available on the site.)

> . . .

> I dug it out of my own email archives, and append it below. I added
> Thomas Chan's second email address. You can take it up with him.

(Above and beyond the call of duty. Thanks.)

> . . .

Concerning where Kanji originated, there have been a number of archeological
finds in Japan that include what appear to be Kanji on items that predate
the historical influence of China, and may in fact predate the Han
characters in China. Don't know if any of this is on the web, but I have
read it I read in newspapers like Kobe Shinbun or Nikkei Shinbun.

Newspapers here are no more accurate than in the USA, and archealogical
dating techniques are known to have problems, but the standard
interpretation of history must always be taken with a measure of salt. Look
how much we have had to revise history in the USA, for all that there is
more than an order of magnitude less official messing with the official
version of facts there.

But that's beside the point. I would assume it really isn't in the UNICODE
Consortium's interest to try to determine who invented Kanji.

> . . .

A great many of the "hidden issues with Kanji" are just as you say, mystical
and ephemeral and not really subject to being stored in bits and bytes at
this point in time.

Here's a concrete example I am beginning to get a handle on: The average
Japanese will tell you that they do not write by radical. The only time they
even study the radicals per se is when they study calligraphy. (A fairly
large percentage of them do take up calligraphy, by the way).

But I watch them write. They may not realize it, but they do write and read
by radical. When they get stuck looking up a word the "easy" way (by
pronunciation) they don't hesitate to dig into the radical index. When they
need to specify over the telephone one character among several with the same
pronunciation, they name the radicals, generally in the order they write
them. It is the same thing as English speakers not recognizing how much they
depend on root words for spelling and cognition.

Or perhaps it is a mystical attitude toward computers, trying to figure out
why their most beautiful simplified shorthand script is too high a level for
the magic box to handle.

UNICODE does have all the radicals, and that is great. I still wonder why
the JIS committees did not bother to include at least the base
representation of all the radicals.

I admit that encoding the Kanji by radical does not seem to make sense, but
the present encoding by whole character hides certain character issues from
them. I picked up on this when I was trying to explain the convenience of
the ctype library to some co-workers at a previous company, and then
realized that the Japanese have simply never bothered to write a ctype
library that works with JIS and publish it. Too much trouble, and they can't
see the benefits. "You can't do that with Japanese." is the usual response
to any suggestions in that direction.

There are some companies that have something like the ctype library, but it
is viewed as not being necessary for ordinary programming use (and too
valuable for what they do use it for to release outside the company).
Contradiction on contradiction.

> . . .

> The Unicode Standard is *not* intended to put historians of Han characters
> out of business. It is not the ultimate, final catalog. It does not
> to resolve all the scholastic questions that will continue to be of
> interest. Heck, Richard S. Cook recently wrote a 250 page monograph on
> The Etymology of Chinese Chen2 (the scorpion character). He lists 208
> bone exempla and 35 bronze exempla, and tracks the whole set of related
> forms through Shuowen and other documents.

(There's the guy who clued me in on extension B. Thanks, Richard, if you're
monitoring the list today.)

> But for global information interchange on computers, *somebody* had
> to put a stake in the ground for Han characters. The alternative was
> a dozen different stakes being moved by different committees from
> different points of view and in different directions. It already was
> chaotic, and the needs of the Internet are slowly pushing that kind
> of chaos aside, in favor of (relatively) simple, interoperable standards.

Yes, we need some basis for international communication. But five good fonts
with 90K+ characters each, and a lot of convoluted rendering and input rules
is more than seems reasonable to put in every desktop. Well, I suppose that
100G HDs and quad GHz processors are going to standard in yet four more
years. What do you do with the keyboard? Oh, never mind, we can just use a
low-to-medium resolution LCD touch panel that changes the keycaps according
to whatever we want to type at the moment. I guess I am probably not being
sarcastic after all. Yikes.

Anyway, what I would have wished for, if you will permit a little
fantasizing, was about 8K (maybe 16K) characters in the common font, just
enough to produce something readable for ordinary business in each language.
Japanese, for instance, would have included only the bare kana, dakuten,
base radicals, and the education characters, for a little less than 1,500
total. No fancy rendering or direction rules, just enough to get each
included character on the screen in some recognizable form.

Characters outside the common set would have been assigned code points by
simple additive translation of existing standards into a 32 bit (or larger)
code space. Each country or other organization registering a code set would
have been assigned their own sub-space(s) in the international code space,
and would have been primarily responsible for tables or rules to transform
to the common set or to other sets. Ditto most of the really difficult
rendering issues.

(Did I hear that the ISO group originally planned for something like this,
but hadn't planned on a common set?)

This sort of idea requires that the standard include some way of inserting
rendering information for codes outside the common set. But since the bulk
of most transmissions would consist of characters from the common set, the
burden of putting the rendering data in with text would not be so great.
Rendering data would have been inserted at the top of a file, to keep it out
of the way of the text data. Since the 32 bit code point would also be part
of the text data, a computer containing its own fonts for a specific
language would be free to substitute, especially in plain text.

Tying this fantasy into my comments on radicals, the presence of the broader
list of fully rendered characters seems to me to discourage attempts at
encoding characters as lists of radicals. I note that UNICODE 3.1 contains
the ideographic description characters, not enough for rendition of course,
but apparently enough for search purposes and for doing _something_ with
those characters that get invented for various technical purposes each year.

> . . .

I'd better quit fantasizing in public and get back to work. Thanks again.

Joel Rees, Media Fusion KK
Amagasaki, Japan

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT