L2/99-322 Comments on JCS compatibility characters in L2/99-310 through L2/99-313 by Lee Collins. 1999-10-11 ============================================================================ Sato-san, Here is my personal response to the questions you raised concerning N2095. I am no longer a Unicode officer or even Apple representative to Unicode, so this merely reflects the personal opinion of a participant in the original CJK-JRG and IRG. Lee ============================================================================ Japan JSC2, JCS is requesting the addition of new compatibility characters to 10646 / Unicode. These are called ³compatibility² characters because a strict interpretation of the character-glyph model would not permit them to be encoded as distinct characters because they are variants of other characters in Unicode. Japan is also asking the UTC to consider in the light of its own request and the request by other Asian countries, whether the Han unification model (originally developed by Japan) should be modified or even abandoned altogether. Background. In the first version of Unicode, we allowed certain characters to be exempted from the requirements of the Unicode character-glyph model for the specific purpose of enabling a loss-less mapping between Unicode and standards that pre-dated Unicode. These exceptions were allowed in many scripts, including Roman ligatures, Arabic contextual forms, and variant CJK characters used in East Asia. The reasoning behind this was that there was a lot of data in existing encodings that did not have a clear character-glyph model and one-to-one mapping would facilitate migration of these older standards to Unicode. The request to have these 56 particular characters encoded in 10646 / Unicode is not new. Because JIS had a well-defined character-glyph model from the beginning, most variants, including the 56 characters proposed in N2095, were not encoded in JIS standards. This is why these characters were not in the original source sets from which candidates for the Unified Repertoire and Ordering (URO, the original set of unified ideographic characters for 10646) were initially drawn. However, at the first CJK-JRG meeting in Tokyo, July 1991, Unicode proposed the inclusion of all but two of these characters (FA3C, a variant of U+5C6E and FA67, which appears identical to U+9B2C) to be considered among the source characters. Unicode¹s argument was that these were legally recognized old-style kanji permitted for use in people¹s names cited in the Japanese Cabinet pronouncement, ³Zhouyou kanzi hyou gendai kanazukai fuku zinmei-you kanzi², (Ministry of Finance, Tokyo, 1987). While these characters were not encoded in JIS, they did make up a well-defined set of variant kanji in common use. Also, most of the common use kanji new-old distinctions listed in this document were already reflected as distinct characters in the URO due to application of the source-separation rule to other source sets, CNS in particular. The addition of this small set of name kanji would have merely completed the set of old and new forms required for everyday use in Japan. Japan rejected these characters in 1991. Now it appears that Japan has changed its mind and encoded them in a JIS standard. Unicode was willing to grandfather them into the URO in 1991, but in 1999 the cost of adding them is much greater, especially if it results in re-mapping of source sets to Unicode. Considering the cost, Japan or any organization that proposes the addition of such ³compatibility characters² should provide more justification than provided in N2095. Here are some questions that need to be answered. The legal status of these characters does not appear to have changed since they were first proposed by Unicode and rejected by Japan in 1991. What has changed so that they now need to be encoded? Why has Japan abandoned its own strict encoding principles and now encoded these variants in JIS X0213? Market-pressure is not convincing since the same pressure existed in 1991. Regarding the argument based on round-trip mapping, the original goal was forward migration of old standards to Unicode. It was never intended that the encoding of such "compatibility characters" should apply to standards created after Unicode 1.0. Unfortunately, the round-trip rule has been much abused since then. There have even been cases of national standards being created solely for the purpose of making the "compatibility" argument to get characters added which should otherwise not have been encoded. Layout technologies have improved since these characters were first considered. In 1991 layout technologies such as QuickDraw GX that can handle variant characters were only on the verge of commercialization and limited to one platform. Now all platforms have layout technologies (Apple¹s ATSUI, Windows¹ OpenType, Java 2D) that eliminate the need to encode variants as separate characters. At the least, we need to understand why handling these variants at the layout level would not work and why it would not be better to spend effort in establishing a standard for interchange of information to describe glyph variants across platforms. Whether the unification principles should be modified or even abandoned altogether is a big question that the UTC alone cannot address. Any discussion should include the original proponents of the current model ( especially Mr Akira Miyazawa of NACSIS) and current members of the IRG. The compatibility characters in the original URO already made application of the current unification model barely comprehensible without reference to the source separation rule. It is not clear that 56 more compatibility characters would make things any worse in this respect. The addition of thousands more might cause us to reconsider the model, but first we need to hear convincing arguments to allow thousands of compatibility characters in the first place. What has been the experience of actually using a unified set of Han characters in various applications, environments and locales? Are there situations where the current Han unification model has actually caused data-loss, resulted in incomprehensible output or other problems? These problems should be looked into and the UTC should determine whether current layout technologies and data interchange standards are actully sufficient to address them. If we do not have sufficient mechanisms for handling character variants, we should consider modifying the model to allow more loose encoding principles. My personal opinion is that current technologies are more than sufficient to handle variant characters. However, we do need to address the issue of how to interchange information about variants across platforms, in web documents, etc. This can best be done by following through with proposals already made by various members and in working with some of the relevant standards: SVG, PDF, etc. Lee Collins