Re: Unicode, Cure-all or Kill-all?

From: Timothy Huang (timd_huang@mail.formac.com.tw)
Date: Mon Aug 12 1996 - 11:00:00 EDT


Dear Martin,

>Well, there are scholars that deny that Chinese characters have meaning,
>and they do this very fervently. But I don't agree with them, and one
>has not to go that far to see that using meaning as a principal guide
>to character encoding is doomed to fail. Whatever some scholars may
>say, if we don't get a solution that is practically usable, we shouldn't
>consider it. And distinguishing one and the same shape with different
>codepoints because that shape can have different meanings is absolutely
>not usable in practice. If you, as an expert very interested in character
>coding, have your problems with multiple codepoints for the same
>shape, how could an arbitrary user not have these problems? On the
>other hand, if I send somebody an email with the correct separate
>encoding of the character(s) for Taiwan, Typhoon, and Sir, what will
>the recipient gain? Will (s)he ever notice?
>Furtheron, there are many characters that have one and the same
>historic origin, but several meanings, now or throughout history.
>For example, should the character "to come" get a second code-
>point for its original historic meaning? And what with all the other
>characters that have a wide field of uses and meanings? How to
>divide this field of meaning into reasonable patches?
>Yet another example, what if somebody writes a text where the
>fact that the character can mean Taiwan, Typhoon, as well as Sir,
>in an artistic way? With the CCCII system, one would have to
>decide for one or the other meaning, and the pun would be
>lost.
>Other than an overestimation of "meaning" from an ivory-tower,
>I cannot understand the decision of CCCII scholars to choose
>such an unpractical solution.

Well, I see your nice point. However, I was not involved during the
CCCII process. And I thinks the "gang of four" professors had their
reasons to adopt the scholars opinion. Maybe this is one area that CCCII
should review. I wonder if the EACC "correct" this or not. I don't have
any figure about how many of this in the CCCII. I think besides that,
there must be some duplicates in there too. After all, they are only
human. And that's why field tests are so important. Criticism and even
ridicules are good for the health. We learn and grow by knowing where we
made the mistakes or inappropriates. As any language, Chinese is
evolving too. How much historical, linguistic value should be applied to
a given coding may vary significantly from different scholars. Mao and
some (a lot of) scholars of his time thought that Chinese characters
were such a burden and hinderance to the national progress. After the
World War II, some Japaness scholars had very similar thoughts too. They
said that Kana should be enough and wanted to eliminate the ideographs
completely. However, I don't see that movement went too far. JIS has
more and more Kanji is the prove of opposite. Ha.

Just find out: It was done due to the proper traditional Chinese
character usages. This is similar to 14 or 15 centry English to modern
English. "How great thou art?" The CCAG just wanted the CCCII be able
to handle the classical usages. For the modern users, he/she can use the
one he/she likes.

>One has to be very careful with such numbers. Most of the "new"
>characters that turn up are simple mistakes.

Well, I did the calculation by using the number of characters from
Tz4Hei4 (of Ming dynasty) and the number published by CCAG in 1989. Now,
on what you called "simple mistakes" -- Let me use one example: A
character in the name of a friend of mine was registered wrong when he
came to Taiwan in 1949. It should be U+5857 (three dot water radical).
However, the officer registered him put this character as two dots, and
created a "new" character. His father did not realized that then, and
when he found that out and wanted to change back, it was too late. All
his legal documents already use that "wrong" character. So, he was
"deadly wrong" to the last day of his life, and the Chinese language got
a new character. Can we remove or eliminate such a "simple mistakes"? If
so, the legal problems may be not that easy to solve. Furthermore,
during the Tang dynasty during the first female emperess Wu, Je-Tein
period, 14 "new" characters were created just to please her. Since then,
almost 1,000 years by now, do you know how many people wanted to
eliminate these flatery characters? Did they succeed? No, not at all.
Without these fourteen 'terrible' characters, archiologist and historian
won't be able to tell a new scroll just unearthed from a grave (of Tang
dynasty) was done in what year. They became the vital tool for
identification. Unless the Chinese decide to cut off from their history
completely, I don't see there is any way to ignor these "simple
mistakes". I personally don't like it either, but what can I do? Any
practical suggestion?

>With UTF-16, Unicode has a codespace of about 1000000 codepoints.
>That's enough for at least the next 500 years.

Question 1) When? 2) Is that ISO 4? 3) Is it 256 x 256 x256 x256? I
remember no too long ago, some heavy weights in the microcomputer
industry said that nobody would ever need more than 640K of memory. And
2 bytes will be enough for the ideographs. Where are their voices now?
We all learned from our mistake, don't we?

>I don't know much about Taiwanese names, but in Japan, it usually
>turns out that most of these "missing characters" are character variants
>that somebody wants to see as a character of its own based on a lack
>of understanding of character history, typography, and caligraphy.

Even so, shouldn't such a person be respected and allowed to have the
right to do so (such a stupid thing)? Suppose, a big suppose, we just
don't like someone's name, say Jon, can we force him to change to John?
I am not sure if America has such a law to do that. Or can we go thru
national public voting process to force him to change that?

>If new characters get created, a mechanism to deal with them before
>they are allocated official codepoints is necessary in any way. The
>problem currently is that such a mechanism is not well established
>or standardized; the easiest case currently is HTML, where you can
>use an inline GIF. Anyway, such mechanisms are needed, but if they
>are used extremely rarely because the basic set covers almost all
>cases, nobody will really be interested to develop such mechanisms
>and implement them. So, strange as it may sound, not having too
>large a basic set can actually help to have mechanisms that allow
>to include even very very rare characters easily into documents.

Good point. When this will be clearly spell out?

>Chess and Japanese Emperors together are 17 codepoints. This is really
>a marginal number. Your set of 70000+ characters won't fit in the BMP
>anyway, and will fit without problems in the UTF-16 area, so there
>is no problem for you.

Actually, there are much much more that that. So many glyphs were coded
as characters. Examples: 1/4, 1/2, 3/4, all sups and subs (2070 ~ 209F),
diacritics (20D0 ~ 20FF), ..., CJK Swuared Words (3300 ~ 337F), ...,
(I'm tied of counting them, now). Do you know how many "characters" in
Unicode version 1.0 for the character meaning of "one"? The total number
of such is not as small as you said. They took up a significant portion
of the precious coding spaces.

Well, it's getting very late, I have to go to bed now. However,
appreciate for letting me have the chance to discuss some of my thoughts
with you.

Smiles, tomorrow will be better.
Timothy Huang



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT