Glen.Perkins@NativeGuide.com (Glen C. Perkins) wrote:
The minority of people who really *do* understand all of this (such as
Handa-san), but still object on the basis of a few features or decisions,
as you are saying, need to be dealt with from the other direction, I think:
what alternative would be more acceptable overall (not on one specific
point, but overall)?
We already have a framework of ISO 2022.
There aren't enough code points to include Chinese, Japanese, and Korean
without either: CJK unification; not unifying, but drastically reducing the
number of chars alloted to each language; not unifying, not reducing the
number of chars per language, but expanding beyond two bytes/char;
From the first, 16-bit is too small for Han characters.
multiple, different encodings (mixture of national single and double byte
??? ISO-2022 can provide single encoding method for mixture of various
national standart character sets.
In reverse order: multiple encodings requires markup because it will be
completely unreadable without marking the switch from one encoding to
I don't understand what you are worring about. ISO-2022 can
effectively handle multiple character sets. When a program reads a
multilingual text, it can put some tag bits for each character code to
identify the character sets. This is exactly what Mule (Multilingual
Enhance emnt to GNU Emacs) is doing.
It then requires duplicate fonts and/or tables for mapping parts
of various encodings to parts of various fonts. The complexity usually
results in systems being effectively monolingual, or at least monoscript.
What we need is to provide an appropriate font for each character set.
Even in Unicode, we need multiple fonts (at least for Japanese and
Chinese as far as I know).
If I were to send Handa-san a message asking, "what does X mean?" where X
was the Korean 'chik' char (the 'choku/zhi' char we've been discussing)
encoded in KSC entered via my Korean input system, then it would come up as
some kanji on his screen when interpreted as if it were shift-JIS (or EUC
or whatever), but heaven only knows which one! National standards are much
poorer at handling this example than unicode is.
This problem never happen if we use ISO-2022-KR (for Korean character
sets) and ISO-2022-JP (for Japanese character sets) and ISO-2022-CN
(for Chinese character sets) or mixture of them for multilingual text.
Actually, all Mule users are exchanging multilingual e-mail without
any difficulty (ISO-2022-CN is not yet supported by Mule because the
standard is decided after the release of the latest version of Mule).
So, what we really need is ISO-2022-INT.
Going beyond two bytes per char can be considered just another form of
markup, really, with additional information attached to each char rather
than to a sequence of chars. A third byte, for example, could be used to
expand the number of code points
Yes, this is what I wrote above.
Most people would object far more
strongly to increasing the number of bytes for every character for every
You should realize that most people don't need true multilingual
environment, the market for such softwares is also very small for the
moment. So, it's not surprising that, for most people, it's more
important to make the burden small than to truely solve the
I don't claim that Unicode is useless for localized software.
Actually, by using Unicode in Japanese localized software, we get much
more characters than just the combination of JISX 0208 and JISX 0212.
I only oppose to those peaple who insist on using Unicode for
internationalized software or multilingual text especially in CJK
The only passable answer I've heard to this question is, "well, keep it to
two bytes, keep the CJK unification, but don't unify it quite so much.
Separate the chars (like 'zhi/choku/chik') which aren't 'correct.'"
I don't know enough to say that this is totally a bad idea. What I can say
is that a large percentage of simplified Chinese chars are likely to be
considered "wrong" by Handa-san's definition because they don't adhere to
the standard form of the traditional radicals, so they wouldn't pass his
"schoolboy" test. I think there are too many characters in this category to
disunify them all.
Too many for what? 2-byte? Why should we start from 2-byte code?
It's not impossible nor hard to handle 3-byte or 4-byte code.
In this spirit, I would ask Handa-san and any other critic what *overall*
solution you would support more than the current *overall* solution.
I'm just claiming that the current one (Unicode) doesn't show
*overall* solution to multilingual environment and we should not
pretend Unicode shows that. And, I believe that a good solution for
multilingual text handling is ISO-2022-INT or the similar one.
(What if we changed the term from "CJK Unification" to "CJK Extension" and
told each country that it was an "extension" of their national standard.
;-) That's all it would probably take for some of the people I've talked
I do agree with your suggestion because then it gets clear that
Unicode is only for localization, no one have dream of using Unicode
--- Ken'ichi HANDA firstname.lastname@example.org
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT