L2/07-020
Date: Mon, 15 Jan 2007
Source: Kenneth Whistler
Title: Feedback Re Proposal to disunify U+4039, L2/07-010
In L2/07-010, Andrew West and John Jenkins propose to disunify
one of the CJK unified ideographs (in Extension A).
I think the lexical evidence they have provided is convincing.
It appears clear that a unification error has been made in
this case, and that two distinct characters (with very
closely similar shapes) have been mistakenly unified.
I can confirm this from my own checking in Cihai, which confirms
what John and Andrew cite from Ciyuan, Kangxi Zidian, and
Hanyu Dacidian.
However, I am concerned that while the conclusion that there
is an error in unification is correct, the proposed solution
in the document as stated is not actionable.
There are a couple of problems. First is the canonical
equivalence they cite for U+FAD4 to U+4039. That equivalence,
is, of course, immutable, but the *identity* of both of
the characters is now in question, and it isn't obvious
which way the interpretation should go. In fact, *either*
solution is possible, sketched out as below. The issue
is further complicated by the fact that there is a *second*
compatibility CJK character, U+2F949, which is also
canonically equivalent to U+4039. The proposers apparently
overlooked that one and didn't take it into account.
To avoid trying to actually use any characters here, I'll
introduce some terms.
Etymon-1 = eyelashes [jie2, jia2, zha3]
Etymon-2 = blink; twinkle [shan3]
Glyph-1 = eye-radical + jia1 phonetic
Glyph-2 = eye-radical + shan3 phonetic
CURRENT
U+4039 = (Etymon-1 + Etymon-2) unified; { Glyph-1 }
U+FAD4 ==> U+4039; { Glyph-2 }
U+2F949 ==> U+4039; { Glyph-1 }
(In other words, the two lexical etyma were unified, but we
show Glyph-1 for U+4039, show Glyph-2 for U+FAD4, which
is canonically equivalent to U+4039, and show Glyph-1 for
U+2F949, which is also canonically equivalent to U+4039.)
WEST/JENKINS SOLUTION
U+4039 = (Etymon-1); { Glyph-1 }
U+FAD4 ==> U+4039; { Glyph-2 }
U+XXXX = (Etymon-2); { Glyph-2 }
U+2F949 ==> U+4039; { Glyph-1 }
ALTERNATIVE SOLUTION
U+4039 = (Etymon-2); { Glyph-2 }
U+FAD4 ==> U+4039; { Glyph-2 }
U+XXXX = (Etymon-1); { Glyph-1 }
U+2F949 ==> U+4039; { Glyph-1 }
The advantage of the West/Jenkins solution is that it leaves
U+4039 with the interpretation of the more common etymon and
with what would be the glyph more likely to show up in fonts.
The disadvantage is that it would leave U+FAD4 canonically
equivalent to U+4039 (which it must), but with the Glyph-2
(which was the reason for encoding U+FAD4 in the first place)
the same as the *new* character, which it wouldn't be equivalent
to.
The alternative solution simply moves the confusion to
a different place. The disadvantage is that it changes
the glyph for U+4039. The advantage is that U+FAD4 then
makes sense. But the confusion would then be for U+2F949,
which would have a glyph different from the character it
is canonically equivalent to, but the same as the new
character U+XXXX, which it wouldn't be canonically
equivalent to.
Given the tradeoff, I think West/Jenkins are probably
correct about this, but I think the pros and cons
of the solution need to be argued more
completely in the proposal document before the UTC could
decide this, as deciding to do a disunification like this,
particularly involving a compatibility CJK ideograph in
the mix as well, sets important precedents. Furthermore, this
whole thing would then have to be argued through the
IRG and WG2 as well, for the proposal to fly.
The tragedy here, of course, is that the correct
answer is that the standard should have *2* encoded
characters. But with the mistaken unification and
the resulting mistaken addition of two more compatibility
characters, we are in a situation where if we then
disunify the *unified* ideographs, we are going to
end up with *4* encoded characters in the standard for
what is actually only 2 characters with 2 glyphs.
And furthermore, because of the immutable canonical
equivalences, there is no way the set of 4 will *ever*
be completely consistent. Sad, indeed.
In fact, this is, to my mind a severe enough problem
that I think it might merit following up any disunification
proposal with a formal *deprecation* of U+FAD4 and
U+2F949. This because their use would no longer be
necessary to make either the lexical nor the glyph distinction
for the two disunified characters, and because any use
of them for mapping into other standards could only produce
bogus results somewhere, and eventually data corruption.
The *second* problem is somewhat related to the first.
The proposal is completely quiet about what would have to
be done to the database mappings. Since a disunification like
this involves, among other things, remapping normative
IRG source mappings, this is a Big Deal(tm).
U+4039 originally got into Extension-A as a unification of
the following two sources:
U+4039 kIRG_GSource 3-5952 (shown with Glyph-1)
U+4039 kIRG_TSource 4-3946 (shown with Glyph-2)
The unification was apparently done, assuming that Glyph-1
and Glyph-2 were not distinct, and that this was simply
another instance of simplified versus traditional font
design differences. Apparently that was wrong, however.
What we are left with, however, is the following Unihan
legacy, after all these years:
U+4039 kCangjie BUKOO
U+4039 kCantonese gap6 sip3 zip3
U+4039 kCheungBauer 109/07;BUKOO;gap6,sip3
U+4039 kCheungBauerIndex 443.11 444.01
U+4039 kCihaiT 951.404
U+4039 kDefinition (same as U+7728) to wink; (same as U+776B) eyelashes,
having one eye smller than the other, joke; witticism; pleasantry; jest; fun;
(Cant.) to peep at; to blink, wink
U+4039 kFennIndex 57.01
U+4039 kHKSCS 98E6
U+4039 kHanYu 42490.020 42490.030
U+4039 kIRGHanyuDaZidian 42490.020
U+4039 kIRGKangXi 0809.030
U+4039 kIRG_GSource 3-5952
U+4039 kIRG_HSource 98E6
U+4039 kIRG_JSource 4-7222
U+4039 kIRG_KPSource KP1-5E34
U+4039 kIRG_TSource 4-3946
U+4039 kJIS0213 2,82,02
U+4039 kKPS1 5E34
U+4039 kMandarin JIA2 SHE4
U+4039 kMatthews 790
U+4039 kPhonetic 550
U+4039 kRSAdobe_Japan1_6 C+18191+109.5.7
U+4039 kRSUnicode 109.7
U+4039 kSBGY 538.34
U+4039 kSemanticVariant U+776B