First of all, I do not intend to argue on the unification rule in itself.
I am concerned with whether Unicode book's *description* about Han
Unification is appropriate or not.
At 11:40 AM -0800 00.5.1, Kenneth Whistler wrote:
> For those following along, the six characters of Figure 10-4, page 264,
> are encoded, respectively, at:
> 5263 528D 528E 5292 5294 91F0
> The first five all have the "knife" radical (cf. U+2F11) and are
> pronounced jian4, meaning "sword, saber". The sixth one has the
> "gold/metal" radical (cf. U+2FA6), and according to the only source
> I have that lists it, is pronounced ri4, meaning "blunt". It is
> conceivable, however, that in the larger dictionaries used by the IRG
> U+91F0 is also listed as a (mistaken) variant of U+528D.
Generally speaking, it is natural that a dictionary differs from another
in description about ITAIJI (Kanji variant). I don't think it is mistake.
> Most importantly, the knife
> *radical* as a component of characters quite commonly shows variation
> between four forms (at least), as illustrated by the right-hand
> side (the radical portion) of 528D, 528E, 5292, and 5294.
No. U+5203 or U+5204 is not a radical but a component which consists of
radical U+5200 plus one more stroke. This is similar to U+5929, witch is
not a radical but a component which consists of radical U+5927 plus one
more stroke. U+5203 is not an ITAIJI of U+5204 and vice versa.
> So, while
> as independent ideographs, 5202 and 5204 *must* be distinguished, in
> most instances for use as radicals for other ideographs, the variation
> in form (and stroke count) is *not* taken as sufficient evidence of
> distinctness of characters. This is also easy to verify with
> dictionaries, even for the characters in question (528D, 528E,
> 5292, and 5294). Ci4hai3, for example, lists 5292 explicitly as
> a variant glyph for 528D, which is used for the head entry.
I rather intend to argue along IRG's principles. In "R.3. Source code
separation examples" of Draft 2 for ISO/IEC 10646-1, I found only
U+5292 vs. U+5294 in respect of this issue.
And Extension B (IRG_N686 version) contains both 2-08F5 (witch consists of
Morohashi No. 8700 and U+528D) and 2-08F6 (witch consists of Morohashi No.
8700 and U+528E).
> I concur that 91F0 is problematical in this list, since even
> by the cognate rule and the radical identity rule it should be
> distinguished. Furthermore, it is arguably in a variant pair
> relationship with 91FC, which is not shown in this list.
> On the other hand, 5251 clearly *should* be added to this list,
> since it is the GB simplified form of the same jian4 "sword"
> So I would suggest emendation of the exemplary list in Figure 10-4
> 5251 5263 528D 528E 5292 5294
And in "R.1.4.3 Different structure of a corresponding component" of
Draft 2 for ISO/IEC 10646-1, left-hand side of U+5263 vs. left-hand side
of U+528D is shown. So I think that Figure 10-4 should be:
> Where 5251 is the GB simplified form (most commonly seen in
> dictionaries in the PRC); 5263 is the traditional Japanese
> simplified form; 528D is the traditional Chinese form
> (most commonly seen in dictionaries in Taiwan and Hong Kong);
> and 528E, 5292, and 5294 are glyphic variants of 528D.
> The source separation rules required distinguishing all 6 of
> these, even though a principled unification which did not have
> to live with legacy encodings surely would have unified
> (528D 528E 5292 5294) into a single character.
No. The source separation rules might be irrelevant to these characters
except U+5292 vs. U+5294.
>> > But please note that if two characters A and B differ
>> > only in components C and D, and C and D are considered
>> > non-cognate or different in abstract shape, this doesn't
>> > automatically mean that A and B are considered to be
>> > different in abstract shape.
>> > There are quite some examples where a difference in a simple
>> > character is important, but if that appears as a component,
>> > the difference becomes less relevant. The most famous case
>> > (usually explained as non-cognate, not as a difference
>> > in abstract shape) is U+571F vs. U+58EB.
>> We might think that U+571F and U+58EB have the same abstract shape
>> (since they have quite similar shape), as you pointed out. On the
>> contrary, U+5202 and U+5204 are not only non-cognate but also quite
>> different in their shape. So, what you've mentioned is questionable for
>> me. I think if components C and D are considered non-cognate AND different
>> in abstract shape then Kanji A and B might be automatically considered to
>> be different in abstract shape.
> Martin is correct about this. As noted above, the difference
> between 5202 dao1 and 5204 ren4 is significant for the independent
> ideographs, but is neutralized when these forms appear as the
> radical of other characters.
> So I think the assessment should be that A and B under these circumstances
> would be considered for distinction, but definitely not be automatically
> separated in the encoding. That would depend on detailed determination
> of how the traditional dictionaries and other sources treat the
> variation in question.
I don't know such a case. Is there an example as shown under?
o Shape of component C is not similar to shape of component D at all.
(Shape of U+571F is quite similar to shape of U+58EB)
o AND components C and D are not cognate.
o But characters A (contains component C) and character B (contains
component D) are to be unified.
> Note also that all of these determinations have *already* been made
> and standardized for the URO (4E00..9FA5) and Vertical Extension A (3400..4DB5),
> and have also been completed by the IRG (and are undergoing the second round of
> ballotting) for Vertical Extension B for Plane 2. So while it is possible to
> argue that the IRG made a mistake here or there on individual characters,
> still as for all other Asian character encoding standards, including those
> published by JIS, we live with the resulting decisions about unification
> or disunification in particular instances and get on with the implementations.
I agree with you on this point.
>> Further, I think that the meaning of "the same abstract shape" is very
>> ambiguous and arbitrary. For example, I can't understand the reason why
>> U+6649 and U+664B are treated as the components that have the same
>> abstract shape, while U+5939 and U+593E are treated as the components that
>> are different in abstract shape in The Unicode Standard.
> Without citations of the character using these as components, it is
> difficult to provide an argument in detail for these 4. The difference
> in treatment may result either from differences in traditional
> lexical treatment of character variants in classical dictionaries,
> or it may be an artifact of source separation in the URO.
The standard IRG dictionaries will be invoked to determine whether
characters A and B are cognate or not. But I understand that unification
rules (without non-cognate rule) may not depend on dictionaries.
-- NAOI Yasushi Glamour Profession, Inc.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT