Re: Unicode, Cure-all or Kill-all?

From: Timothy Huang (timd_huang@mail.formac.com.tw)
Date: Mon Aug 12 1996 - 06:27:23 EDT


Dear Martin,

Thanks for your 3 very informative letters. Your point on multiple
codepoints is well taken. Some times, even myself have this problem too.
However, according to the scholars involved in the CCCII, these
characters should be coded separately, because they are different
characters. I think this is a key point in the definition of character.
Should the character definition be shape-base or meaning-base? I
personally don't know, but I agree with these experts. Furthermore, if
this is the nature of the Chinese language, what else can we do except
accepting that as a fact. On the other hand, every language is a living
organism and hence changing all the time. This may be changed already.
Again, I don't know.

Regarding to the "new character" issue -- (1) There always be new
Chinese characters generated all the time. Unlike the alphabetical
languages, Chinese character set is an open set. On average, since Ming
dynasty, about every 3 days, a new character shows up. (2) Now, if the
coding space is a closed (16x16 bits) or limited (such as Big-5) one,
then, accomandating them will be difficult or impossible. (3) And, if
the number of characters coded is small (such as GB, Big-5, Unicode),
then, the users are forced to use the private zone heavily. And that
defeat the very purpose of information interchange, because each user
will have different code assignments -- information can not be
interchange then. I don't consider myself is any different from other
average users, but from my experiences, using Big-5 character set, I
have to make about 2~3 'new' characters per month. Especially, in the
case of maintaining a customer mailing address database. For example, in
my little customer database, containing only about 400 person, I am
short of about 10 characters for their names and the addresses -- and I
can NOT use any substitute characters for these. In English computing,
you are very lucky, you usually don't have this problem. But, here, this
is a very annoying problem. Whenever I have to send an article to my
publisher, I usually have to send a "new character" file along,
otherwise, the article can not be printed correctly.

Well, about MaJong, do you know during the past, it was considered as an
evil gambling instrument and was banned by both Chinese governments? My
childhood neighbor was taken by the policeman. The point I want to say
is not the MaJong itself, or why the chess was in. The same goes for the
emperor's names. These are not the main issues, but just the examples.
The key point is are they "characters"? If not, why they take up so many
precious coding spaces? If yes, then, what's the definition of
character? Unicode has a relatively good definition of character. But,
during the implementation, this principle was not hold rigid, too many
non-characters slipped in. And that creates confusions.

This brings up my personal view on the ideographic coding issue. I think
the EACC/CCCII is already a better solution than the on-going
Unicode/ISO-10646. EACC/CCCII has been field tested for more than ten
years. (Current) Unicode is not, and the new expanded Unicode is not
born yet. Why re-invent the wheel? Why we couldn't we use that sound and
solid foundation and improve on that? Please note, I am not saying that
EACC/CCCII is perfect, cure-all, etc., I am not that stupid yet. It does
not and can not totally eliminate the private zone problem. However,
with such a big collection of characters, an user rarely has the need to
use the private zone, unless for the real new characters (such as the
new chemical elements), and thus the information interchange is enhanced
with less problem.

Smiles,
Timothy Huang



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT