Unicode CJK Language Myth

From: Mark Davis (mark_davis@taligent.com)
Date: Wed May 08 1996 - 19:49:45 EDT


Subject: Unicode CJK Language Myth Time: 14:27 Date:
05/08/96

The Unicode CJK Language Myth

A misunderstanding about Unicode Han Unification seems to be prevalent in
certain circles. For example, a recent IAB character set document contained
the following quote, which implies that Unicode is not usable for CJK without
additional language information.

"Although the workshop decided not to explicitly address the so-called "CJK
rathole", a few members felt it was necessary to have some mechanism to
address the problem of correct Han character display in the ISO-10646 issue,
and that saying that it was a "font issue" did not suffice.

The "CJK" rathole" refers to the extended discussion about "Han unification",
the use of a single ISO-10646 codepoint to represent multiple national
variants of a Chinese (Han) character. ISO-10646 can map uniquely to any
single CJK national character set, but in the absence of additional
information an application can not display an ISO-10646 text using the proper
national variants for that text."

This implication is incorrect. The goal and methods of Han Unification were to
ensure that the text remained legible. While font, size, width and other
format specifications would need to be added to produce precisely the same
appearance on the source and target machines, in the absence of these the
plain text will still be legible.

To quote Lee Collins: "There is never any confusion in Unicode since the
distinctions between the unified characters are all within the range of
stylistic variations that exist in each country. Unless it is a mistake, there
is no unification in Unicode that should cause a reader to think that one
character is another or cause a reader to fail to identify a character if it
appears in a different font."

Some Practical Scenarios

Before rushing to conclusions about the requirements for adding language
information, consider the typical scenarios for mail messages. In all of these
scenarios, language information is NOT vital.

Scenario 1. Japanese user sends out unmarked Japanese text. Readers are
Japanese (with Japanese fonts).

Scenario 2. Japanese user sends out unmarked mixture of Japanese and Chinese
text. Readers are Japanese (with Japanese fonts) and Chinese (with Chinese
fonts). Readers see the mixed text with only one font, however the text is
still legible. Readers recognize the difference between the languages by the
content.

Scenario 3. Japanese user sends out mixture of Japanese and Chinese text. Text
is marked with font, size, width, and so on because the exact format is
important. Readers have the fonts & other display support. Readers see the
mixed text with different fonts for different languages. Readers both
recognize the difference between the languages by the content, and see the
text with glyphs that are more typical for the particular language.

Relevant Anecdotes

After sending out a (shorter) message on this same subject, I got the
following replies in the same vein.

Anecdote 1. "Yes, yes. I once took our NeXT Japanese product home, on my home
machine, and showed it to my wife. I put up a hunk of JIS code chart on the
screen preliminary to showing her how the Japanese input system works. The
first words out of her mouth were: "Naaaaani sore? Chukokugo?" which, roughly
translated, means: "What's *that*? Chinese?"

I explained that it's the Japanese national standard (rendered in a very
finely designed Morisawa Mincho font)... and she said she's never seen MOST of
those characters..."

Anecdote 2. "To add another piece here and a bit of humour, I want to tell a
story that happened a few months ago in Redmond. We have had a large group of
Japanese engineers from a famous and very large Japanese computer company.
Anyway, we were doing some experiment of Ideographic font legibility on a TV
screen. So I figured I would prepare a screen shot of characters, put that on
a TV and ask some Japanese guys what they thought.

So now I took a page of what is the output of the ancestor of the now famous
gridfontformatter, select a Gothic font (made by the Ricoh company) from our
Japanese OS and put that on a screen. And to my big surprise, my Japanese
colleagues didn't comment on the font legibility at all, they flatly denied
that the font could be Japanese at all! It had to be Chinese for sure.

After some head scratching I tried to understand what happened there:

1) the font was shown in Unicode order (that is fairly random for a Japanese
user)

2) as the order was random, it is likely that a good 2/3 of the characters
were unknown to the viewers

So even a Japanese designed font can look absolutely Chinese to the casual
Japanese user, given an appropriate setting."

Real Evidence Needed

The CJK Language proponents need to come up with actual printed documents and
scenarios that illustrates the alleged problems. Failing such evidence,
software programmers and standard designers do not need to let these purported
problems impede their progress towards adoption of Unicode as the worldwide
character encoding.

Mark Davis

P.S. Any other anecdotes are appreciated!



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT