L2/99-322

Comments on JCS compatibility characters in L2/99-310 through L2/99-313 
by Lee Collins.

1999-10-11

============================================================================
Sato-san,

Here is my personal response to the questions you raised concerning N2095. I
am no longer a Unicode officer or even Apple representative to Unicode, so
this merely reflects the personal opinion of a participant in the original
CJK-JRG and IRG.

Lee

============================================================================

Japan JSC2, JCS is requesting the addition of new compatibility characters
to 10646 / Unicode. These are called ³compatibility² characters because a
strict interpretation of the character-glyph model would not permit them to
be encoded as distinct characters because they are variants of other
characters in Unicode. Japan is also asking the UTC to consider in the light
of its own request and the request by other Asian countries, whether the Han
unification model (originally developed by Japan) should be modified or even
abandoned altogether.


Background.


In the first version of Unicode, we allowed certain characters to be
exempted from the requirements of the Unicode character-glyph model for the
specific purpose of enabling a loss-less mapping between Unicode and
standards that pre-dated Unicode. These exceptions were allowed in many
scripts, including Roman ligatures, Arabic contextual forms, and variant CJK
characters used in East Asia. The reasoning behind this was that there was a
lot of data in existing encodings that did not have a clear character-glyph
model and one-to-one mapping would facilitate migration of these older
standards to Unicode.


The request to have these 56 particular characters encoded in 10646 /
Unicode is not new. Because JIS had a well-defined character-glyph model
from the beginning, most variants, including the 56 characters proposed in
N2095, were not encoded in JIS standards. This is why these characters were
not in the original source sets from which candidates for the Unified
Repertoire and Ordering (URO, the original set of unified ideographic
characters for 10646) were initially drawn. However, at the first CJK-JRG
meeting in Tokyo, July 1991, Unicode proposed the inclusion of all but two
of these characters (FA3C, a variant of U+5C6E and FA67, which appears
identical to U+9B2C) to be considered among the source characters.

Unicode¹s argument was that these were legally recognized old-style kanji
permitted for use in people¹s names cited in the Japanese Cabinet
pronouncement, ³Zhouyou kanzi hyou gendai kanazukai fuku zinmei-you kanzi²,
(Ministry of Finance, Tokyo, 1987). While these characters were not encoded
in JIS, they did make up a well-defined set of variant kanji in common use.
Also, most of the common use kanji new-old distinctions listed in this
document were already reflected as distinct characters in the URO due to
application of the source-separation rule to other source sets, CNS in
particular. The addition of this small set of name kanji would have merely
completed the set of old and new forms required for everyday use in Japan.

Japan rejected these characters in 1991. Now it appears that Japan has
changed its mind and encoded them in a JIS standard. Unicode was willing to
grandfather them into the URO in 1991, but in 1999 the cost of adding them
is much greater, especially if it results in re-mapping of source sets to
Unicode.

Considering the cost, Japan or any organization that proposes the addition
of such ³compatibility characters² should provide more justification than
provided in N2095. Here are some questions that need to be answered.

The legal status of these characters does not appear to have changed since
they were first proposed by Unicode and rejected by Japan in 1991. What has
changed so that they now need to be encoded? Why has Japan abandoned its own
strict encoding principles and now encoded these variants in JIS X0213?
Market-pressure is not convincing since the same pressure existed in 1991.

Regarding the argument based on round-trip mapping, the original goal was
forward migration of old standards to Unicode. It was never intended that
the encoding of such "compatibility characters" should apply to standards
created after Unicode 1.0. Unfortunately, the round-trip rule has been much
abused since then. There have even been cases of national standards being
created solely for the purpose of making the "compatibility" argument to get
characters added which should otherwise not have been encoded.

Layout technologies have improved since these characters were first
considered. In 1991 layout technologies such as QuickDraw GX that can handle
variant characters were only on the verge of commercialization and limited
to one platform. Now all platforms have layout technologies (Apple¹s ATSUI,
Windows¹ OpenType, Java 2D) that eliminate the need to encode variants as
separate characters. At the least, we need to understand why handling these
variants at the layout level would not work and why it would not be better
to spend effort in establishing a standard for interchange of information to
describe glyph variants across platforms.

Whether the unification principles should be modified or even abandoned
altogether is a big question that the UTC alone cannot address. Any
discussion should include the original proponents of the current model (
especially Mr Akira Miyazawa of NACSIS) and current members of the IRG.

The compatibility characters in the original URO already made application of
the current unification model barely comprehensible without reference to the
source separation rule. It is not clear that 56 more compatibility
characters would make things any worse in this respect. The addition of
thousands more might cause us to reconsider the model, but first we need to
hear convincing arguments to allow thousands of compatibility characters in
the first place.

What has been the experience of actually using a unified set of Han
characters in various applications, environments and locales?  Are there
situations where the current Han unification model has actually caused
data-loss, resulted in incomprehensible output or other problems? These
problems should be looked into and the UTC should determine whether current
layout technologies and data interchange standards are actully sufficient to
address them. If we do not have sufficient mechanisms for handling character
variants, we should consider modifying the model to allow more loose
encoding principles.


My personal opinion is that current technologies are more than sufficient to
handle variant characters. However, we do need to address the issue of how
to  interchange information about variants across platforms, in web
documents, etc. This can best be done by following through with proposals
already made by various members and in working with some of the relevant
standards: SVG, PDF, etc.


Lee Collins