From: Uriah Eisenstein (firstname.lastname@example.org)
Date: Tue Aug 17 2010 - 13:03:07 CDT
Great :) I'm attaching then the results file which made me raise the
original question. I generated it with a Python script, actually, using the
Each line indicates one asymmetric relation: the character with a variant,
the variant type (Z for Z-variant or M for semantic variant), and the
variant which does not refer back to the original character.
The script also checked for asymmetric Simplified/Traditional pairs, but
didn't find any :)
On Tue, Aug 17, 2010 at 7:05 PM, John H. Jenkins <email@example.com> wrote:
> Any help we can get in cleaning up the Unihan data is greatly appreciated.
> It would be very, very useful.
> On Aug 17, 2010, at 3:10 AM, Uriah Eisenstein wrote:
> Continuing this issue - I've played a bit with SQL access to Unihan data,
> and found also a few kDefinition fields which are only one or two characters
> long, e.g. "c" or "lr". I suppose other seemingly erroneous entries could be
> My question is, would it be useful if I gather and send such data (which
> I'd happily do), or do the Unihan maintainers have enough tools to find it
> and just need the time and resources to act on it?
> Uriah Eisenstein
> On Wed, Jun 30, 2010 at 11:55 AM, Uriah Eisenstein <
> firstname.lastname@example.org> wrote:
>> I see... Thanks for your answer. I suppose it should be easy enough to
>> find some of the inconsistencies, such as asymmetrical variant relations,
>> the real issue would be resolving them case-by-case.
>> A specific case where resolution, too, seems as though it should be easy
>> is when supposed Z-variants have quite a different total stroke count. This
>> can be checked with just the Unihan data, I could do that myself (after
>> overcoming the usual issues programming languages have with characters
>> outside the BMP).
>> On Tue, Jun 29, 2010 at 9:36 PM, John H. Jenkins <email@example.com>wrote:
>>> The kZVariant field has bad data in it that we haven't had time to clean
>>> up. It should, in theory, be symmetrical, and it should, in theory, contain
>>> only unifiable forms, but as you note, it doesn't. In addition to the use
>>> of the source separation rule, it should also cover characters which were
>>> added to the standard in error.
>>> In any event, I'm afraid that right now it's probably best not to rely on
>>> it for anything.
>>> On Jun 29, 2010, at 8:25 AM, Uriah Eisenstein wrote:
>>> To clarify my question with an example :) The character 亀 (U+4E80) is
>>> listed in Unihan as a Z-variant of 龜 (U+9F9C). However, the opposite is not
>>> true. Similarly, 疍 (U+758D) is listed as a semantic variant of 蛋 (U+86CB),
>>> but not vice versa. From the definitions of these variant types in UAX#38,
>>> one would naturally expect them to be symmetrical, and both characters to
>>> show each other as variants. There are quite a few other such cases,
>>> although it does appear that in most cases the relation is symmetrical.
>>> My reason for asking, BTW, is that I'm thinking of grouping characters
>>> which are Z-variants of each other in some application, so I need to
>>> understand whether Z-variants are expected to have clear "cliques" in which
>>> each character is a Z-variant of all others.
>>> I realize that the semantic variant relation, at least, is based on
>>> external sources and not determined by Unicode; regarding Z-variants I'm not
>>> clear. I'd like to know though whether the relation is expected to be
>>> symmetrical, and the above cases are to be considered errors; or there is
>>> some meaning to a one-directional relation; or something else.
>>> On a side note, some Z-variants I've looked at seem to have very
>>> different abstract shapes, in some cases looking more like
>>> simplified/traditional pairs. As I said I don't know clearly how they are
>>> determined. Are they supposed to be exactly those pairs which would be
>>> unified if it were not for the Source Separation Rule?
>>> John H. Jenkins
> Hoani H. Tinikini
> John H. Jenkins
This archive was generated by hypermail 2.1.5 : Tue Aug 17 2010 - 13:05:59 CDT