L2/07-020 Date: Mon, 15 Jan 2007 Source: Kenneth Whistler Title: Feedback Re Proposal to disunify U+4039, L2/07-010 In L2/07-010, Andrew West and John Jenkins propose to disunify one of the CJK unified ideographs (in Extension A). I think the lexical evidence they have provided is convincing. It appears clear that a unification error has been made in this case, and that two distinct characters (with very closely similar shapes) have been mistakenly unified. I can confirm this from my own checking in Cihai, which confirms what John and Andrew cite from Ciyuan, Kangxi Zidian, and Hanyu Dacidian. However, I am concerned that while the conclusion that there is an error in unification is correct, the proposed solution in the document as stated is not actionable. There are a couple of problems. First is the canonical equivalence they cite for U+FAD4 to U+4039. That equivalence, is, of course, immutable, but the *identity* of both of the characters is now in question, and it isn't obvious which way the interpretation should go. In fact, *either* solution is possible, sketched out as below. The issue is further complicated by the fact that there is a *second* compatibility CJK character, U+2F949, which is also canonically equivalent to U+4039. The proposers apparently overlooked that one and didn't take it into account. To avoid trying to actually use any characters here, I'll introduce some terms. Etymon-1 = eyelashes [jie2, jia2, zha3] Etymon-2 = blink; twinkle [shan3] Glyph-1 = eye-radical + jia1 phonetic Glyph-2 = eye-radical + shan3 phonetic CURRENT U+4039 = (Etymon-1 + Etymon-2) unified; { Glyph-1 } U+FAD4 ==> U+4039; { Glyph-2 } U+2F949 ==> U+4039; { Glyph-1 } (In other words, the two lexical etyma were unified, but we show Glyph-1 for U+4039, show Glyph-2 for U+FAD4, which is canonically equivalent to U+4039, and show Glyph-1 for U+2F949, which is also canonically equivalent to U+4039.) WEST/JENKINS SOLUTION U+4039 = (Etymon-1); { Glyph-1 } U+FAD4 ==> U+4039; { Glyph-2 } U+XXXX = (Etymon-2); { Glyph-2 } U+2F949 ==> U+4039; { Glyph-1 } ALTERNATIVE SOLUTION U+4039 = (Etymon-2); { Glyph-2 } U+FAD4 ==> U+4039; { Glyph-2 } U+XXXX = (Etymon-1); { Glyph-1 } U+2F949 ==> U+4039; { Glyph-1 } The advantage of the West/Jenkins solution is that it leaves U+4039 with the interpretation of the more common etymon and with what would be the glyph more likely to show up in fonts. The disadvantage is that it would leave U+FAD4 canonically equivalent to U+4039 (which it must), but with the Glyph-2 (which was the reason for encoding U+FAD4 in the first place) the same as the *new* character, which it wouldn't be equivalent to. The alternative solution simply moves the confusion to a different place. The disadvantage is that it changes the glyph for U+4039. The advantage is that U+FAD4 then makes sense. But the confusion would then be for U+2F949, which would have a glyph different from the character it is canonically equivalent to, but the same as the new character U+XXXX, which it wouldn't be canonically equivalent to. Given the tradeoff, I think West/Jenkins are probably correct about this, but I think the pros and cons of the solution need to be argued more completely in the proposal document before the UTC could decide this, as deciding to do a disunification like this, particularly involving a compatibility CJK ideograph in the mix as well, sets important precedents. Furthermore, this whole thing would then have to be argued through the IRG and WG2 as well, for the proposal to fly. The tragedy here, of course, is that the correct answer is that the standard should have *2* encoded characters. But with the mistaken unification and the resulting mistaken addition of two more compatibility characters, we are in a situation where if we then disunify the *unified* ideographs, we are going to end up with *4* encoded characters in the standard for what is actually only 2 characters with 2 glyphs. And furthermore, because of the immutable canonical equivalences, there is no way the set of 4 will *ever* be completely consistent. Sad, indeed. In fact, this is, to my mind a severe enough problem that I think it might merit following up any disunification proposal with a formal *deprecation* of U+FAD4 and U+2F949. This because their use would no longer be necessary to make either the lexical nor the glyph distinction for the two disunified characters, and because any use of them for mapping into other standards could only produce bogus results somewhere, and eventually data corruption. The *second* problem is somewhat related to the first. The proposal is completely quiet about what would have to be done to the database mappings. Since a disunification like this involves, among other things, remapping normative IRG source mappings, this is a Big Deal(tm). U+4039 originally got into Extension-A as a unification of the following two sources: U+4039 kIRG_GSource 3-5952 (shown with Glyph-1) U+4039 kIRG_TSource 4-3946 (shown with Glyph-2) The unification was apparently done, assuming that Glyph-1 and Glyph-2 were not distinct, and that this was simply another instance of simplified versus traditional font design differences. Apparently that was wrong, however. What we are left with, however, is the following Unihan legacy, after all these years: U+4039 kCangjie BUKOO U+4039 kCantonese gap6 sip3 zip3 U+4039 kCheungBauer 109/07;BUKOO;gap6,sip3 U+4039 kCheungBauerIndex 443.11 444.01 U+4039 kCihaiT 951.404 U+4039 kDefinition (same as U+7728) to wink; (same as U+776B) eyelashes, having one eye smller than the other, joke; witticism; pleasantry; jest; fun; (Cant.) to peep at; to blink, wink U+4039 kFennIndex 57.01 U+4039 kHKSCS 98E6 U+4039 kHanYu 42490.020 42490.030 U+4039 kIRGHanyuDaZidian 42490.020 U+4039 kIRGKangXi 0809.030 U+4039 kIRG_GSource 3-5952 U+4039 kIRG_HSource 98E6 U+4039 kIRG_JSource 4-7222 U+4039 kIRG_KPSource KP1-5E34 U+4039 kIRG_TSource 4-3946 U+4039 kJIS0213 2,82,02 U+4039 kKPS1 5E34 U+4039 kMandarin JIA2 SHE4 U+4039 kMatthews 790 U+4039 kPhonetic 550 U+4039 kRSAdobe_Japan1_6 C+18191+109.5.7 U+4039 kRSUnicode 109.7 U+4039 kSBGY 538.34 U+4039 kSemanticVariant U+776B