Re: Unicode CJK Language Myth

From: Kenichi Handa (handa@etl.go.jp)
Date: Fri May 31 1996 - 21:48:48 EDT


mduerst@ifi.unizh.ch writes:
> Ken'ichi Handa writes:
>> Why should I write again and again that the difference is beyond what
>> is allowed as font variations?

> This is not true. JIS 208, the basic Japanese standard, for example,
> does not disallow such a font variation in any way. The new edition,
> JIS X 0208-1996 (or more probably 1997), is even more specific
> about this.
> This is good because we don't want to artificially limit the creativity
> of font designers.

I hope my English is good enough to interprete difficult Japanese
sentence used in the draft of JISX0208-1996.

At first, in my sentence "the difference is beyond what is allowed as
font variations", the subject of "allow" is Japanese, not JISX0208. I
know JISX0208 puts no criterion on font variations.

And, I don't like JISX0208 also because of its excessive unification
done mainly to cover/correct the ambiguity of the previous version.

Have you read the "generalization criterion" ("housetsu kijun" in
Japanese) in Section 5.6.2 and Rationale 2.7 of the draft of JISX
0208-1996? Especially Rationale 2.7 is important to understand the
meaning of the "generalization criterion".

It is stated that the generalization criterion can be used to identify
which code point of JISX0208 an EXISTING Kanji character glyph (not
font/shape/style) corresponds to, and it should not be used to admit
or create NON-EXISTING glyph by using the criterion as deduction rule.
Here I used (and have used in previous mails) the word "glyph" as the
translation of Japanese word "jitai" which term is defined in JISX0208
as follows:
        JITAI: Abstract concept about a shape of graphic
        representation of graphic character.
"Jitai" is different from "shape" (translation of Japanese word
"jikei") which is defined as follows:
        JIKEI: Concrete represenation of JITAI as hand-written, printed,
        or displayed character.

Perhaps I should have used "abstract shape" (Unicode term) instead of
"glyph".

JISX0208 defines which "jitai (glyph)" and the allowable variants (by
the criterion) of existing Kanji characters can be regarded as a
graphic character of a certain code point of the standard. But it
doesn't define which "jikei (shape)" should be used for "jitai". In
addition, it doesn't define which variant is the correct "jitai" of an
existing character.

Glyph itself is an abstract concept and exists only in peoples's mind,
so we have to use common sense to get a glyph of a character from
shapes which we actualy see. Since JISX0208 allows wider variants
than what our (Japanese) common sense identies a character glyph, it
lists such variants in the standard.

Going back to the case of `choku', we can say that JISX0208 does not
contain a charcter which has the Chinese glyph because:
1) Such a character does not exist in Japan (i.e. not being used as a
variant of Japanese glyph `choku' in Japan), but of course as far as I
know.
2) Even if the glyph has ever existed as Japanese somewhere in Japan,
the glyph can't be deduced from the base glyph for Japanese `choku' by
the generalization criterion.
3) Our common sense doesn't identify the Chinese glyph as the variant.

The Unicode Standard Vol.2 refers "treatment in a source character
set" as one of "character features". The treatments of "choku" in
GB2312 and JISX0208 are different.

>> I have no idea why mine can't be recognized as a valid objection.

> The problem is that your objection is not based on actual in-use
> observation, but only on theoretical what-would-happen-if
> argumentation,

Since I've not yet seen Unicode being used in any multilingual
software, all I can do is to infer what will happen if ... What't the
problem? I'm not assuming any unusual situation, though one may say
that real necessity of multilingualization itself is vary rare.

> and that you have seen the two variants side-by-side
> in a standard before seeing them in actual use.

I don't understand the intention of the above sentence. Do you think
we don't notice the difference of those variants? We can easily
notice it. Noticing the difference of shapes and identifying them as
variants of the same character or not is different.

> Assume a newspaper
> would print the "wrong" glyph variant in one of their articles. How
> many people would recognize the difference? How many people would
> have difficulties understanding the text? How many people would bother
> writing a letter to the editor, or mentionning that character to a
> friend? I am sure all these figures would be very low,

I agree with it because most people will just think it's a simple
error happening occasionally like many other typos. But, if the
"wrong" glyph is used always, I'm sure someone warn the editor.

>> And, no Japanese character set contain a character which allows
>> Chinese `choku' variant. In this sence, a character which allows
>> Chinese `choku' variant is different from the Japanese character which
>> doesn't allow the variant.

> This is not true. See above.

This is true if you read "allows Chinese `choku' variant" as "allows
Chinese `choku' glyph as a variant of Japanese `choku' glyph of
JISX0208".

Of course, a font of JISX0208 which has a shape like Chinese `choku'
shape does conform to JISX0208, in which case, we should regard that
the SHAPE is just a concrete graphic representation of Japanese
`choku' GLYPH.

>> You are saying something like that since ASCII does not contain some
>> greek character, ASCII does not distinguish the character `a' from the
>> greek character.
>>
> And while for Greek characters, the Greeks definitely think that
> an 'a' and an 'alpha' are different,

Here, the point is not whether Greeks know the difference or not, nor
whether there really exist the difference or not. I opposed to the
thought that a character set does not distinguishes a character not
contained in it from characters contained in it.

> for the character "choku" the
> Chinese think that both variants are the same character, and the
> Japanese would do so too if they knew the Chinese variant.

There's a difference between "Japanese know that Chinese peopled use
the difference variant" and "Japanese think that they had better allow
the variant for Japanese character".

k> It's nonsense to say some character set distinguishing or not two
k> characters if one is not included in the set. If we dare to say
k> something, a character set distinguishes characters contained in the
k> set from all characters not contained in the set.

> There is no way to see the Chinese variant as a different character,
> only as a variant that might be more or less known, and more
> or less accepted in some typographic situations.

The above sentence is not a logically valid objection to my paragraph
(indented by "k>"). In addition, you are not based on "actual in-use
observation". I've never seen the Chinese variant `choku' is used in
a Japanese text in any situations. This is my "actual in-use
observation".

>> Very simple. Just use two character set Japanese JISX0208 and Chinese
>> GB2312 (or/and CNS11643) concurrently. There exist no incompatibility
>> as far as we use internationalized encoding methods (ISO-2022-INT and
>> X's Compound Text are the examples) and internationalized internal
>> character representation (Mule's method and X.V11R5's Xsi method are
>> the examples).

> The main problem here is that this makes it difficult to find the same
> character in texts in different languages. That is an important aspect
> of multilingual computing,

What should be regarded as the same character depends on the
situation. I have no idea what Unicode can contribute to solve this
problem.

> but is highly impossible with a proliferation
> of national character sets. Implementers in multilingual information
> retrieval are very happy users of Unicode.

If a criterion for identifying characters concerns only a glyph,
Unicode also has a difficulty. If a criterion concerns not a glyph
but meaning of ideographic characters, Unicode also has a difficulty.
Even Chinese people will want to search some character while ignoring
the differnce of simplified and traditional glyphs.

How are implementers happy with Unicode on this problem? What
criterion do you have in your mind?

I believe there's no way to solve this problem without creating
data-bases for various criterions.

> For a Japanese computer user, working applications are the best
> argument. And I am sure they will appear in the near future, maybe
> even without the users noticing. If you do it right, there is nothing
> that could be noticed :-).

I'm also sure that many users won't notice the problem, but it is just
because many don't use Japanese/Chinese mixed environment. This
doesn't mean what Unicode is doing the right thing for multilingual
environment.

>> We (at leat mule) have not technical difficutly for handling multiple
>> double-byte character sets with more-than-16-bit charcater code
>> internally.
> True, you invested a lot of work into these things. Did you ever
> count how much this was?

Actually, it didn't require that much work just for handling Mule's
internal character code. The difficulty of multilingualization exists
in such a different place as input method, regular expression, how to
recognize words, etc.

> And what other nice things you
> could have implemented in that time, e.g. proportional rendering,
> real language tagging useful for any languages, and so on?

These have no meaning without that we can handle plain text correctly.

> I don't want to criticise you too much because at the time you started
> mule, Unicode was not yet available. But this does not mean that
> you should not try to see things from a neutral point of view.

I knew about the plan of Unicode at the time I started implementing
Mule. A code of Mule's beta version contained some macros which might
be used for supporting Unicode in the feature. But, more and more I
know about the unification of Unicode, I grew negative feeling about
it, and, at last, I deleted them.

---
Kenichi HANDA
handa@etl.go.jp



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT