Re: Unicode CJK Language Myth

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue May 14 1996 - 19:15:21 EDT


>
> >To quote Lee Collins: " .................... Unless it is a mistake, there
> >is no unification in Unicode that should cause a reader to think that one
> >character is another or cause a reader to fail to identify a character if it
> >appears in a different font."
>
> I was unaware that both `choku' in "choku-setsu (in Kanji)" and
> `zhi' in "yi-zhi (in Simplified Hanzi)" derived from the same
> character (U+76F4 in Traditional Hanzi) until I saw the word "choku-setsu"
> written in a Chinese-first Unicode font because their glyphs are quite
> different. A Japanese who doesn't know Chinese could fail to identify
> the character if it is written in a Chinese font.
>

Mori-san has come up with an excellent example. For those of you following
this discussion who may not be familiar with CJK fonts, the two glyphs
in question appear roughly as follows. Sorry I can't show the fine points
of corners and stroke terminations with a bunch of asterisks in an ASCII
text file, but you should get the idea:
 
      * *
************* *************
     * *
  ********* *********
  * * * *
  ********* *********
  * * versus * *
  ********* *********
  * * * * *
  ********* * *********
  * * *
************** **************

Both of these glyphs represent U+76F4. The glyph on the left is what
is typically seen in a Chinese font. (The "zhi" in "yi-zhi" in Chinese.)
The glyph on the right is *always* used in Japanese fonts -- at least
in all examples I have seen. It is the "choku" in "choku-setsu". For
those of you with JIS charts, this is JIS X 0208-1999 character 3630
(Ward 36, Point 30).

The Unicode Standard, Version 1.0, Volume 2 was printed (by necessity)
with a mixture of fonts from different sources, because no complete
Unicode font existed at that time. The glyph for U+76F4 printed in
that book (on p. 276) came from a Japanese-designed font, and so
appears as the glyph on the right.

Now in 1996, more than one complete font of the 20,902 characters of the Unified
Han ideographic characters exists -- some designed by Chinese font
foundries and some by Japanese font foundries -- so that it is
possible to print the entire set in nationally appropriate glyphic
styles.

Thus, for example, when GB 13000.1-93 (the Chinese national standard
corresponding to ISO/IEC 10646-1) is printed, U+76F4 is shown
with the glyph on the left. When JIS X 0221-1995 (the Japanese national
standard corresponding to ISO/IEC 10646-1) is printed, U+76F4 is
shown with the glyph on the right.

As regards legibility, I think Mori-san is quite correct that if
presented with the glyph on the left *in isolation* or embedded in
otherwise unfamiliar Chinese text, a Japanese person could fail to
identify the character as being identical to the "choku" of "choku-setsu".
There are enough differences in the glyphs that in principle it
could represent a different character -- though in fact it does not --
and since Japanese fonts always show the glyph on the right, the glyph
on the left will look completely unfamiliar.

However, I also believe it to be the case that if a mailer or
some other process substituted a Chinese font containing the glyph
on the left for the expected Japanese font containing the glyph
on the right, Japanese text would still be perfectly legible to
a Japanese speaker. For example, if the following sentence:

sore ga kare no shinpai no chokusetsugenin de atta.
("That was the immediate cause of his apprehension.")

were rendered with a Chinese style font (using the glyph on the left)
that otherwise had appropriate representations for the Hiragana
and for the kanji used for "kare", "sinpai", and "genin", a
Japanese reader would quite likely read that straight off without
even noticing the non-standard glyph used for "choku" -- or
would, at most, note that the characters "look strange", but
still have no trouble reading the sentence.

The important measure of legibility is legibility in context --
not identification of isolated glyphs out of context.

The good news is that it would be quite easy to do objective,
reproducible testing to measure legibility of this sort -- if
anyone would care to set up the appropriate statistical controls
and design the tests.

--Ken Whistler
Technical Director, Unicode, Inc.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT