Re: unified or over-unified (was Unicode CJK Language Myth)

From: Martin J Duerst (mduerst@ifi.unizh.ch)
Date: Sun Jun 09 1996 - 15:05:34 EDT


Ken'ichi Handa writes:

>mtoyo@Lit.hokudai.ac.jp writes:
>> It exists.
>> This GB-looking `choku' (ie. without vertical stroke at left) was fairly
>> common in Japan before Meiji era(1868-). Even in Meiji era, there exist
>> typefaces such as `GYOSHO-TAI' (hand-written style) in `TSUKIJI-KATSUJI'
>> (one of the major printing offices existed in Tokyo) movable type set.
>> They were used in printing Japanese.
>> ...
>
>Thank you very much for the quite surprising (at least for me)
>information. I'm ashamed of not knowing this fact, and I express my
>apology to those who claimed that Unicode doesn't make "choku" case
>worse.

There is no reason to be ashamed about the fact that you didn't know
this. But what we all should learn for the future is that with respect
to kanji, we never know everything!

>So the actual case for "choku" is not:
> (1) Japanese people regard them as of different JITAI
> (abstract shape), therefore Japanese think they don't have
> a character of Chinese JITAI.
> Chinese people regard them as of the same JITAI.
>but:
> (2) Both people regard them as of the same JITAI.

The terms "Japanese people" and "Chinese people" are
much too general. Every individual may have a different view
of that character. What is important is that Japanese either
don't know the shape, or then, if they know its historical
use and such, can agree that it is primarily a font difference.

>But I still have the following question to Unicode:
> Can all the unified charaters in Unicode be regarded as the
> (2) case? Were all characters unfied after checking it?

I haven't been in the commitee, but I know that these characters
where checked and rechecked quite extensively.

>If Unicode assures that all unified characters are in the case (2),
>then I'll be satisfied with Unicode that it is not worse than
>JISX0208.

If there were two or three (or maybe even ten or twenty) cases
where for the huge benefit of unification, some compromise
had to be made, would that be so bad?

>But, as far as I konw, the book The Unicode Standard (Vol.1&2) does
>not mention about such a principle. The book says only that
>difference in "treatment of a source character set" leads to code
>separation. How much the "treatment" is considered is still doubtful.

I think this is very clear. If two shapes have different codepoints
in a given source (which in case of Japan consists of JIS 208 and 212,
apart from the 7-bit JIS-Roman and half-width kana), then these
shapes have different codepoints in Unicode.

>For instance, Unicode puts different points for U+6384 and U+6451.
>Those are unified in JISX0208 but JISX0212 contains U+6451. How can
>"source set separation" rule treat this kind of situation?

First, U+6384 should be U+63B4. The source set separation rule
cannot deal with this kind of situation, but up to now, this situation
does not exist. The confusion is due to the very bad handling of
the 1983 revision of what is now JIS 208. In that revision, several
codepoints where exchanged, and several shapes where changed
otherwise. Those changes are frequently all put into the same
pot, but there are two classes that should be clearly distinguished:
- Small changes that can be attributed to changes in the font used
        to print the standard. There are such changes even between
        several printings of the same edition of the standard. These
        are not relevant for the definition of the characters themselves,
        but some people have thought that they were, and therefore
        the variability of Japanese fonts has been very unfavorably
        reduced.
- Bigger changes, very much equivalent to the exchanges mentionned
        above, just that the "other" shape was outside the code table.
U+63B4 is definitely such a case. If you eliminate the "hand" radical,
you also have both the traditional and the simplified variant in
JIS 208, because the difference in stroke count and shape is too
big to just unify them.
Now some people were not happy with these changes, and some even
claim that the newly introduced shapes never existed in actual
use before the 83 version of the standard was introduced. And so
some companies continued to use the old JIS 78 shapes even if
otherwise, they might have adhered more than necessary to JIS 83,
and even if they claimed that their fonts conformed to JIS 83.
So in 1990, when the extension of JIS 208 by JIS 212 was discussed,
it was one possible solution to include the old variants from JIS 78
into JIS 212, to try to settle the matter clearly. JIS 208 contained
U+63B4, and JIS 212 contained U+6451. Although it is not written
explicitly in that way, it is very clear that in 1990, this was how
the standard tried to settle this matter. This was also the version
of the standards that was submitted by Japan to the IRG, and
thus it is very clear and obvious how these characters appear in
Unicode, and to which Japanese codepoints they are mapped.

Now comes the 1996 (or most probably 1997) version of JIS 208.
This version is done by a very knowledgeable commitee with
high ambitions, and they work very hard on documenting
the standard very well and covering up emissions of the past.
And they feel they have to do something about the problem
of those shapes swapped out of the standard in 1983 and put
into 212 in 1990. They are not satisfied with the 1990 solution
because still quite some fonts sold as 1990-compatible contain
the pre-1983 shapes. They are sceptical about JIS 212, because
it is not yet available on that many systems, and they even
have the idea of abolishing it and replacing it by something vastly
different.
[This seems currently only an idea, and given that JIS 212 is indeed
not available on that many systems, it might even look attractive.
But I am sure that when JIS 212 is abolished, many people will
heavily complain. One should finally understand (the change from
JIS 78 to JIS 83 should be lesson enough) that character standards
should NOT be changed, even if one thinks to have good reasons
for it.]
Also, the people currently working on the JIS 208 revision have
decided that they would not add new codepoints to the standard.
So they decide to just accept current practice, even if that was
not the intent in 1990, and even if other solutions are available.
So they just define that for a very restricted number (22) of codepoints,
not only the usual shape/font variation is allowed, but also some
change that is very clearly above such small variations.

Unicode is based on the 1990 JIS version. When the CJK unification
was made, the committee was safe in the assumption that the
old problems from 1983 had been solved. If the JIS standard now
changes because the committee thinks it has to try yet another
solution for the mess produced in 1983, you cannot hold Unicode
responsible for it.
In my oppinion, the solution to this problem would have been to
add the necessary 22 codepoints to JIS 208, so that those font
makers and users that want the 78 forms at all cost can use them,
and there is no conflict with JIS 212 or Unicode. The idea to have
a codepoint representing (A or B) and another only representing
B is an interesting idea, but character encoding up to now does
not work that way.

>For instance, Unicode unifies characters in U+80B2, one of them is in
>CNS11643-1 (4B3F) and the other is in CNS11643-6 (2D69). I'd like to
>ask to Taiwanese people how they will treat Unicode (or ISO10646)
>along with CNS series.

Again, Unicode is based on the version of CNS available at the time
the unification was done, and plane 6 where CNS11643-6 (2D69)
is contained was not available at that time.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT