CJK Unification (was: A basic question on encoding Latin characters)

From: Otfried Cheong (otfried@cs.ust.hk)
Date: Sat Sep 25 1999 - 05:15:14 EDT


> There are a whole lot of Unihan characters that need multiple
> glyphs (according to choice of simplified Ch, trad. Ch,
> Japanese or Korean).

There seems to be a lot of confusion, even among Unicode enthousiasts,
about Han unification. Perhaps future versions of the standard should
add more explanations of this controversial issue. Let me explain my
view of the affair. I am not an expert at all, so please correct me
where I am wrong!

First of all, there has NOT been a single unification between a (PRC)
simplified and a traditional Chinese character. Similarly, I am not
aware of a single unification between a simplified Japanese and a
traditional character. (That could not happen, since many
non-simplified versions are still in use in Japan and encoded in
JISx0212, so there would be no round-trip compatibility for documents
encoded in a mixture of JISx0208 and JISx0212).

So what has been unified? Roughly speaking, characters that have the
same meaning and shape. One has to understand that there are
variations of many of the primitive elements that appear in Chinese
characters, and writers have traditionally been free to write as they
preferred---the more elaborate form when carving a giant character in
stone, the simpler form when writing a shopping list... Different
Chinese fonts have used different variants of some characters before
Unicode even existed.

My last name, for instance, is U+912d:

  * *
   * * *****
*********** *
   * * * *
 ******** * *
 * * * * * *
 ** *** * *
 * * * *
 ******** * *
 * * * *
 ******** * *
    * * *
********* * *
   * * * **
  * ** *
** * *

The two dots in the top left can be written either like this /\ or
like this \/, with no difference in meaning. Everybody can easily
recognize both variants. Different fonts will show them in different
ways, even within the same locale, and so the characters have been
unified in Unicode. (There is, by the way, a PRC-simplified version,
which is encoded as U+90d1---no unification there!)

Another example of a character element that appears in two variants is
the "black" radical. It can be written either in its traditional form
as U+9ed1 (with two little dots), or in its simplified form as U+9ed2
(where the two dots have been replaced by a single stroke).
"Simplified", by the way, doesn't refer to a political process here,
it is simply the way people have written for centuries to write
faster---just like the syllable "un" in handwritten English looks like
a single wiggled line.

I'm not sure why these two characters have not been unified---other
characters that contain this element have been. For instance, you will
find U+9edb with both variants of the radical in different fonts, and
nobody will have any difficulty recognizing them as variants of the
same character. (But compare the distinction U+9ed8 versus U+9ed9,
which are again variants of the same character. They are not unified
since there is a structural change in the makeup of the character.)

A less obvious example is U+76f4, which has two variants:

        ; ;
  '''''';'''''' ''''''';'''''''
    ;'''''''; ;''''''';
    ;,,,,,,,; ; ;,,,,,,,;
    ; ; versus ; ; ;
    ;'''''''; ; ;''''''';
    ;,,,,,,,; ; ;,,,,,,,;
    ; ; ;
 ''''''''''''''' '''''''''''''''

Here the difference is really one of locale, you'll find the right
hand glyph in Japanese or Korean fonts, the left hand one in Chinese
fonts. Most Japanese won't recognize them as the same character,
while I believe Chinese and Koreans would (but I didn't). (Japanese
won't mind to see the vertical crossing stroke to be drawn slanted,
and in fact that variant can be found in Japanese fonts.)

Is it okay to unify character variants that ordinary people wouldn't
recognize? Yes, of course, just as Suetterlin script should be
encoded with the Latin-1 repertoire, even though most non-German
speaker (and even most young German speakers) cannot read it easily.

So what is all the controversy about?

There are several issues here:

(1) "This is not my name"

People can be attached to particular shapes of particular characters.
Japanese would be unhappy if you wrote their name using a variant
different from what they consider "their name". (Interestingly, they
do not mind at all if you don't know how to pronounce the name.)
There is a market in Japan for font software that allows you to modify
glyphs to be able to typeset exactly the variant you have in mind!

I believe the JIS standard actually prescribes the shapes of the
glyphs for each character, and this is perhaps exactly the grief that
Japanese have with Unicode. If you are used to think about a
codepoint being associated with a well-defined shape, the lose view
that Unicode takes seems rather careless.

Chinese seem much less fixed on a particular variant, and are exposed
to more variations in daily life. The difficulty is thus that what is
a negligible font variation to a Chinese is a major shape change to a
Japanese observer.

(2) "Preserving our legacy"

Chinese, Koreans, and Japanese have been keeping records of their life
for between one and two thousand years, and there is a large amount of
literature that is slowly but steadily being transcribed into digital
form. These documents contain archaic characters that are no longer
used---so these have to be encoded---and archaic variants of
characters still in use. Should these be replaced by the modern
"unified" variant? Not if we want to faithfully preserve the
document, right? The CNS character set defined in Taiwan, where a
huge effort is being conducted in digitalizing the classics, now
contains about 50000 characters, for exactly this reason. The CCCII
character set has a 94 x 94 x 94 code space arranged in multiple
layers that contain variants of the same characters.

I am not sure what direction Unicode is taking with respect to this
issue. Unicode 3.0 improves support for CNS, but gives up the source
separation rule, so that CNS/Unicode roundtrip compatibility is no
longer possible.

(3) "What do these Chinese characters do in my letter?"

People are afraid to see alien fonts due to unification. This seems
to be mostly a misunderstanding and is actually quite independent of
the shape variations. A typical mainland Chinese font is easily
recognizable as such by its style, even if there is not a single
simplified character in a sentence. A Japanese, Korean, or even
Hong-Konger would be quite unpleased to see characters in this font
style appear in their letters. Conversely, the Japanese gothic style
isn't appreciated very much in Hong Kong (although there are so many
Japanese articles for sale here in original packing that one would
hardly notice).

You'll say "of course, everybody knows that there is no such thing as
a Unicode font". Well, actually that's not quite true either. For
instance, the style called Mincho in Japan originated in Ming-dynasty
China and is universally acceptable in the CJK countries. The
Japanese fontmaker Typebank has a Chinese Mincho font with
PRC-simplified characters approved by the Chinese government that
shares glyphs with their Japanese Mincho font, and it is quite
feasible to make a Mincho font that will serve the mainland
Chinese/Korean/Japanese users decently.

The main difficulty in making a "Unicode font" is not the style, but
the character variations discussed above. Since Japanese are most
concerned about using a particular variant, and since Chinese will
recognize most variations, one can indeed make a kind of Unicode
ideographic font that uses Mincho style and the Japanese variant where
applicable. This is exactly what the "CJK Dictionary Publishing
Society" (http://www.cjk.org/) is doing for their "Dictionary of
Unified CJK Characters" that shows a single glyph for each character
(the font being made by Dynalab in Taiwan), and I believe this is also
how Bitstream made their Cyberbit font. I'm not claiming that this is
the perfect solution---certainly not for an appliation for a specific
market---but the font will be readable and acceptable for all CJK
users, even though people may be surprised by some unfamiliar glyph
shapes.

Despite all the apparent differences, CJK cultures have a common
heritage, and Chinese characters form a strong part of that. Despite
differences in writing style and changes in meaning, a Japanese can
travel through mainland China without speaking a word of Chinese,
communicating by notes in Chinese characters. I don't make a mental
difference between a character I see in Japan or in Hong Kong, and
Unicode Han unification reflects this. I think this is a good thing.

Otfried Cheong



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT