Re: Unicode CJK Language Myth

From: Martin J Duerst (mduerst@ifi.unizh.ch)
Date: Mon Jun 03 1996 - 10:43:42 EDT

Next message: Tom Fruchterman: "Illuminator: a free Unicode editor for Motif and Unix"
Previous message: Michael Everson: "Re: Khmer offline???"
Maybe in reply to: Mark Davis: "Unicode CJK Language Myth"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Ken'ichi Handa writes:

>mduerst@ifi.unizh.ch writes:
>>
>>> Why should I write again and again that the difference is beyond what
>>> is allowed as font variations?
>
>> This is not true. JIS 208, the basic Japanese standard, for example,
>> does not disallow such a font variation in any way. The new edition,
>> JIS X 0208-1996 (or more probably 1997), is even more specific
>> about this.
>> This is good because we don't want to artificially limit the creativity
>> of font designers.
>
>I hope my English is good enough to interprete difficult Japanese
>sentence used in the draft of JISX0208-1996.

Well, to interprete Japanese, your Japanese has to be good enough,
and it certainly is :-).

>At first, in my sentence "the difference is beyond what is allowed as
>font variations", the subject of "allow" is Japanese, not JISX0208. I
>know JISX0208 puts no criterion on font variations.

Sentences or clauses without a subject are a difficult thing.
Japanese uses this a lot, and has its own rules for it. English
has different rules. The fact that you don't mention a subject
in English does not mean that you implicitly understand what
the subject is (and can assume the reader understands the same),
but that the statement is true in some general sense. In the
case of allow, if you say "it is not allowed", then this usually means
that there is some kind of law or rule that disallows it.
So that is the way I understood it, and as we both agree that
there is no such law or rule, we can end this part of the discussion
here.

As for your claim that the difference (between the usual Chinese
and the Japanese glyph variant of "choku") is beyond what
Japanese allow as font variation, I think this very much depends
on the circumstances (see also below). In some cases, e.g.
if it appears in context, there is a rather high probability that
they would not even notice the difference. If it appears in a
fancy logo, they would probably also not object, either.

>And, I don't like JISX0208 also because of its excessive unification
>done mainly to cover/correct the ambiguity of the previous version.
>
>Have you read the "generalization criterion" ("housetsu kijun" in
>Japanese) in Section 5.6.2 and Rationale 2.7 of the draft of JISX
>0208-1996? Especially Rationale 2.7 is important to understand the
>meaning of the "generalization criterion".

Yes, I have read that. I specifically don't like point (8) of Section 5.6.2
because instead of trying to clear up what is left from the mess of
the 1978->1983 change, it messes things up even more, and produces
conflicts with JIS 212 and JIS 221. I have expressed this in a comment
to the JIS commitee, not with a very satisfactory result (the main
response I got was that JIS 212 has not yet been checked as thoroughly
as JIS 208).

>It is stated that the generalization criterion can be used to identify
>which code point of JISX0208 an EXISTING Kanji character glyph (not
>font/shape/style) corresponds to, and it should not be used to admit
>or create NON-EXISTING glyph by using the criterion as deduction rule.

I agree that this is what the JIS 208 draft says, and also that new variants
should not be created by deduction. But what is important here is that
the standard does not prohibit the creation of new variants (even if they
would fall outside the "hosetsu kijun"), and that it also makes no sense
to prohibit it. Despite computer standards, font designers should still
have their creative freedom (which of course they should use with
care, depending on the task at hand).

>Here I used (and have used in previous mails) the word "glyph" as the
>translation of Japanese word "jitai" which term is defined in JISX0208
>as follows:

Terminology is a difficult area, but usually context helps even
if words are not used exactly according to a standard terminology.
But if you want to be faithfull to ISO (and JIS) terminology, you should
not use "glyph" and "jitai" as the same.

>Glyph itself is an abstract concept and exists only in peoples's mind,
>so we have to use common sense to get a glyph of a character from
>shapes which we actualy see. Since JISX0208 allows wider variants
>than what our (Japanese) common sense identies a character glyph, it
>lists such variants in the standard.

The term "jitai", as other things such as "jikei" and "ji" itself, can
be used very differently in (Japanese) common sense. Every single
person has different oppinions, and, probably more important,
every person not exactly aware of all the issues gives different
answers depending on the situation. JIS 208 in general makes
quite reasonable assumptions about what the majority of the
people would expect for their texts. Whether what JIS 208 defines
as "jitai" and uses for unification is indeed what people on the
street call jitai or not is not that important.
Also, some of the unification that JIS 208 makes is due to
character history. Not all people that know the modern
characters also know their history.

The main reason that the new JIS 208 gives more information than
previous versions about the meaning of the standard is that the old
versions have been misinterpreted, too narrowly in most cases, mainly
by people that did not read all of the text and did not think about it.

>Going back to the case of `choku', we can say that JISX0208 does not
>contain a charcter which has the Chinese glyph because:
>1) Such a character does not exist in Japan (i.e. not being used as a
>variant of Japanese glyph `choku' in Japan), but of course as far as I
>know.
>2) Even if the glyph has ever existed as Japanese somewhere in Japan,
>the glyph can't be deduced from the base glyph for Japanese `choku' by
>the generalization criterion.
>3) Our common sense doesn't identify the Chinese glyph as the variant.

JIS 208 does not define any glyph shapes. The generalization criterion
is designed to help identifying any existing glyph shapes you find
"on the street" with a character in the standard. This "help" is rather
complete as far as the standard Mincho fonts and present-day
typographic practice goes. It does not cover other types of fonts,
esp. no fancy fonts, and of course it cannot forsee the future.
If in the future a font designer decides that the "Chinese" form of
"choku" is what (s)he want in a new design, JIS 208 has absolutely
no problem with such a decision.

>>> I have no idea why mine can't be recognized as a valid objection.
>
>> The problem is that your objection is not based on actual in-use
>> observation, but only on theoretical what-would-happen-if
>> argumentation,
>
>Since I've not yet seen Unicode being used in any multilingual
>software, all I can do is to infer what will happen if ... What't the
>problem? I'm not assuming any unusual situation, though one may say
>that real necessity of multilingualization itself is vary rare.

It's not the rareness of true multilingual text processing that
you are ignoring, but within the field of multilingual text
processing, the fact that the problems that you are assuming
won't come up that often, and won't "hurt" as much as you are
assuming.

>> and that you have seen the two variants side-by-side
>> in a standard before seeing them in actual use.
>
>I don't understand the intention of the above sentence. Do you think
>we don't notice the difference of those variants? We can easily
>notice it. Noticing the difference of shapes and identifying them as
>variants of the same character or not is different.

If you show the two variants to somebody in Japan side by side,
they will immediately point out the difference. But if you "hide"
a "wrong" shape in a text, most people will read the text and
completely ignore the wrong shape.
Even more interestingly, most would be very surprised that
they had not seen the difference after it was pointed out to
them. But this is not a sign that they don't know the characters
well (as I am sure some of them would be affraid to admit), but
actually a sign that they know the characters, and Japanese in
general, very well and read very efficiently.
Also, of those that actually remarked the difference, most
would not notice at first sight. They might get a little unease
after having read over it, then they would go back to check
again, and only after some deliberation would they decide
that something is different from usual.

>> Assume a newspaper
>> would print the "wrong" glyph variant in one of their articles. How
>> many people would recognize the difference? How many people would
>> have difficulties understanding the text? How many people would bother
>> writing a letter to the editor, or mentionning that character to a
>> friend? I am sure all these figures would be very low,
>
>I agree with it because most people will just think it's a simple
>error happening occasionally like many other typos.

What is interesting here happens much earlier: as said above,
most of the readers would never recognize the different shape!

>But, if the
>"wrong" glyph is used always, I'm sure someone warn the editor.

The main reason for this is that if it's used frequently, the probability
that somebody actually notices the difference is higher.

>>> And, no Japanese character set contain a character which allows
>>> Chinese `choku' variant. In this sence, a character which allows
>>> Chinese `choku' variant is different from the Japanese character which
>>> doesn't allow the variant.
>
>> This is not true. See above.
>
>This is true if you read "allows Chinese `choku' variant" as "allows
>Chinese `choku' glyph as a variant of Japanese `choku' glyph of
>JISX0208".
>
>Of course, a font of JISX0208 which has a shape like Chinese `choku'
>shape does conform to JISX0208, in which case, we should regard that
>the SHAPE is just a concrete graphic representation of Japanese
>`choku' GLYPH.

This is just playing around with words. The fact is that if a font
designer decides to choose the Chinese variant for the JIS 208
code point "choku", this conforms to JIS 208.

>> And while for Greek characters, the Greeks definitely think that
>> an 'a' and an 'alpha' are different,
>
>Here, the point is not whether Greeks know the difference or not, nor
>whether there really exist the difference or not. I opposed to the
>thought that a character set does not distinguishes a character not
>contained in it from characters contained in it.

A character set definition tries to do this as well as possible or
as well as necessary. For Greek, it's rather straightforward, at
least with respect to Latin and 'a' and 'alpha'. For other things,
such as Coptic, it might be more difficult.
For kanji, the current version of JIS 208 thought it did enough,
and the new version is trying to do more. Nevertheless, the
nature of writing disallows that we can be sure that we exactly
know, for all unconsidered and maybe future variants, to
know exactly and absolutely what is subsumed by the codepoints
of a standard and what not.

>> for the character "choku" the
>> Chinese think that both variants are the same character, and the
>> Japanese would do so too if they knew the Chinese variant.
>
>There's a difference between "Japanese know that Chinese peopled use
>the difference variant" and "Japanese think that they had better allow
>the variant for Japanese character".

Nobody says that Japanese have to "allow" the Chinese variant in the
sense that they have to put up with a font that contains this if they
don't like it. What I say is just that if the Chinese variant happended
to appear on a Japanese monitor, there are only two answers:

- (by those who know the Chinese variant): This is "choku", in the
        variant that is very common in China.
- (by those who don't know the Chinese variant): I don't know
        what character this is.
What is excluded is a third answer, something like:
- (nobody actually answers that way): This is a different character,
        definitely not "choku", as it has some completely different
        meaning.

>If a criterion for identifying characters concerns only a glyph,
>Unicode also has a difficulty. If a criterion concerns not a glyph
>but meaning of ideographic characters, Unicode also has a difficulty.
>Even Chinese people will want to search some character while ignoring
>the differnce of simplified and traditional glyphs.

Yes, usually you want to ignore more differences that the standard
does. The main advantage for Unicode here is not that you don't
have to have some tables, but that you don't have to care about
whether two values in memory can be considered equivalent or
not. You don't have to do on-the-fly conversion between e.g. GB
and Big-5 all the time, and you have a predefined way of representing
things that appear in both of them, and of representing things
that appear just in either of them.

>> And what other nice things you
>> could have implemented in that time, e.g. proportional rendering,
>> real language tagging useful for any languages, and so on?
>
>These have no meaning without that we can handle plain text correctly.

Yes. But there are different oppinions about plain text. And a Unicode-based
system is definitely closer to what the average user understands with
plain text than Mule. In Mule, you sacrified thousands of character
equivalences, e.g. in all the Latin-X extensions, and of course in CJK,
that an average user would think about of being "the same", for the
sake of extremely few cases where one could also argue that they were
"over-unified". (note that this does not mean that I think that they have
been over-unified).

Also, even if it is correct that plain text comes first, there is no reason
to not think about what comes later, and to integrate these things
into your system design.

Regards, Martin.

Next message: Tom Fruchterman: "Illuminator: a free Unicode editor for Motif and Unix"
Previous message: Michael Everson: "Re: Khmer offline???"
Maybe in reply to: Mark Davis: "Unicode CJK Language Myth"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT