L2/07-020

Date: Mon, 15 Jan 2007
Source: Kenneth Whistler
Title: Feedback Re Proposal to disunify U+4039, L2/07-010

In L2/07-010, Andrew West and John Jenkins propose to disunify
one of the CJK unified ideographs (in Extension A).

I think the lexical evidence they have provided is convincing.
It appears clear that a unification error has been made in
this case, and that two distinct characters (with very
closely similar shapes) have been mistakenly unified.

I can confirm this from my own checking in Cihai, which confirms
what John and Andrew cite from Ciyuan, Kangxi Zidian, and
Hanyu Dacidian.

However, I am concerned that while the conclusion that there
is an error in unification is correct, the proposed solution
in the document as stated is not actionable.

There are a couple of problems. First is the canonical
equivalence they cite for U+FAD4 to U+4039. That equivalence,
is, of course, immutable, but the *identity* of both of
the characters is now in question, and it isn't obvious
which way the interpretation should go. In fact, *either*
solution is possible, sketched out as below. The issue
is further complicated by the fact that there is a *second*
compatibility CJK character, U+2F949, which is also
canonically equivalent to U+4039. The proposers apparently
overlooked that one and didn't take it into account.

To avoid trying to actually use any characters here, I'll
introduce some terms.

Etymon-1 = eyelashes [jie2, jia2, zha3]
Etymon-2 = blink; twinkle [shan3]

Glyph-1 = eye-radical + jia1 phonetic
Glyph-2 = eye-radical + shan3 phonetic

CURRENT

U+4039  = (Etymon-1 + Etymon-2) unified; { Glyph-1 }
U+FAD4  ==> U+4039; { Glyph-2 }
U+2F949 ==> U+4039; { Glyph-1 }

(In other words, the two lexical etyma were unified, but we
show Glyph-1 for U+4039, show Glyph-2 for U+FAD4, which
is canonically equivalent to U+4039, and show Glyph-1 for
U+2F949, which is also canonically equivalent to U+4039.)

WEST/JENKINS SOLUTION

U+4039  = (Etymon-1); { Glyph-1 }
U+FAD4  ==> U+4039;   { Glyph-2 }
U+XXXX  = (Etymon-2); { Glyph-2 }
U+2F949 ==> U+4039;   { Glyph-1 }

ALTERNATIVE SOLUTION

U+4039  = (Etymon-2); { Glyph-2 }
U+FAD4  ==> U+4039;   { Glyph-2 }
U+XXXX  = (Etymon-1); { Glyph-1 }
U+2F949 ==> U+4039;   { Glyph-1 }

The advantage of the West/Jenkins solution is that it leaves
U+4039 with the interpretation of the more common etymon and
with what would be the glyph more likely to show up in fonts.
The disadvantage is that it would leave U+FAD4 canonically
equivalent to U+4039 (which it must), but with the Glyph-2
(which was the reason for encoding U+FAD4 in the first place)
the same as the *new* character, which it wouldn't be equivalent
to.

The alternative solution simply moves the confusion to
a different place. The disadvantage is that it changes
the glyph for U+4039. The advantage is that U+FAD4 then
makes sense. But the confusion would then be for U+2F949,
which would have a glyph different from the character it
is canonically equivalent to, but the same as the new
character U+XXXX, which it wouldn't be canonically
equivalent to.

Given the tradeoff, I think West/Jenkins are probably
correct about this, but I think the pros and cons 
of the solution need to be argued more
completely in the proposal document before the UTC could
decide this, as deciding to do a disunification like this,
particularly involving a compatibility CJK ideograph in 
the mix as well, sets important precedents. Furthermore, this
whole thing would then have to be argued through the
IRG and WG2 as well, for the proposal to fly.

The tragedy here, of course, is that the correct
answer is that the standard should have *2* encoded
characters. But with the mistaken unification and
the resulting mistaken addition of two more compatibility
characters, we are in a situation where if we then
disunify the *unified* ideographs, we are going to
end up with *4* encoded characters in the standard for
what is actually only 2 characters with 2 glyphs.
And furthermore, because of the immutable canonical
equivalences, there is no way the set of 4 will *ever*
be completely consistent. Sad, indeed.

In fact, this is, to my mind a severe enough problem
that I think it might merit following up any disunification
proposal with a formal *deprecation* of U+FAD4 and
U+2F949. This because their use would no longer be
necessary to make either the lexical nor the glyph distinction
for the two disunified characters, and because any use
of them for mapping into other standards could only produce
bogus results somewhere, and eventually data corruption.

The *second* problem is somewhat related to the first.
The proposal is completely quiet about what would have to
be done to the database mappings. Since a disunification like
this involves, among other things, remapping normative
IRG source mappings, this is a Big Deal(tm).

U+4039 originally got into Extension-A as a unification of
the following two sources:

 U+4039  kIRG_GSource    3-5952 (shown with Glyph-1)
 U+4039  kIRG_TSource    4-3946 (shown with Glyph-2)
 
The unification was apparently done, assuming that Glyph-1
and Glyph-2 were not distinct, and that this was simply
another instance of simplified versus traditional font
design differences. Apparently that was wrong, however.

What we are left with, however, is the following Unihan
legacy, after all these years:

U+4039  kCangjie        BUKOO
U+4039  kCantonese      gap6 sip3 zip3
U+4039  kCheungBauer    109/07;BUKOO;gap6,sip3
U+4039  kCheungBauerIndex       443.11 444.01
U+4039  kCihaiT 951.404
U+4039  kDefinition     (same as U+7728) to wink; (same as U+776B) eyelashes, 
having one eye smller than the other, joke; witticism; pleasantry; jest; fun; 
(Cant.) to peep at; to blink, wink
U+4039  kFennIndex      57.01
U+4039  kHKSCS  98E6
U+4039  kHanYu  42490.020 42490.030
U+4039  kIRGHanyuDaZidian       42490.020
U+4039  kIRGKangXi      0809.030
U+4039  kIRG_GSource    3-5952
U+4039  kIRG_HSource    98E6
U+4039  kIRG_JSource    4-7222
U+4039  kIRG_KPSource   KP1-5E34
U+4039  kIRG_TSource    4-3946
U+4039  kJIS0213        2,82,02
U+4039  kKPS1   5E34
U+4039  kMandarin       JIA2 SHE4
U+4039  kMatthews       790
U+4039  kPhonetic       550
U+4039  kRSAdobe_Japan1_6       C+18191+109.5.7
U+4039  kRSUnicode      109.7
U+4039  kSBGY   538.34
U+4039  kSemanticVariant        U+776B<kMatthews
U+4039  kTotalStrokes   12
U+776B  kSemanticVariant        U+4039<kMatthews
U+FAD4  kCompatibilityVariant   U+4039
U+2F949 kCompatibilityVariant   U+4039

At this point, *NO* proposal to disunify a unified CJK ideograph
is well-formed, in my opinion, unless it *explicitly* deals
with all relevant fields in the Unihan database, and provides
a detailed specification of what the after picture will look
like. In other words, not only must the exact details be
specified for what each of the normative IRG source fields
will end up pointing to: U+4039 or U+XXXX, but all of the
rest of the fields must be appropriately updated, as well,
including any relevant splitting of definition and pronunciation
fields, and updating of the dictionary references to incorporate
the correct entry. Note for example:

U+4039  kHanYu  42490.020 42490.030

The kHanYu field is pointing to *two* entries in the HanYu
Dacidian dictionary. If those two entries are for the two
*different* characters in question, then that field needs to
be sorted out appropriately. And so on for *every* field.

--Ken