Re: Are Unihan variant relations expected to be symmetrical?

From: Uriah Eisenstein (
Date: Tue Aug 17 2010 - 13:03:07 CDT

  • Next message: Mark Davis ☕: "Re: Are Unihan variant relations expected to be symmetrical?"

    Great :) I'm attaching then the results file which made me raise the
    original question. I generated it with a Python script, actually, using the
    3rd-party cjklib.
    Each line indicates one asymmetric relation: the character with a variant,
    the variant type (Z for Z-variant or M for semantic variant), and the
    variant which does not refer back to the original character.
    The script also checked for asymmetric Simplified/Traditional pairs, but
    didn't find any :)

    On Tue, Aug 17, 2010 at 7:05 PM, John H. Jenkins <> wrote:

    > Any help we can get in cleaning up the Unihan data is greatly appreciated.
    > It would be very, very useful.
    > On Aug 17, 2010, at 3:10 AM, Uriah Eisenstein wrote:
    > Hi,
    > Continuing this issue - I've played a bit with SQL access to Unihan data,
    > and found also a few kDefinition fields which are only one or two characters
    > long, e.g. "c" or "lr". I suppose other seemingly erroneous entries could be
    > found.
    > My question is, would it be useful if I gather and send such data (which
    > I'd happily do), or do the Unihan maintainers have enough tools to find it
    > and just need the time and resources to act on it?
    > Regards,
    > Uriah Eisenstein
    > On Wed, Jun 30, 2010 at 11:55 AM, Uriah Eisenstein <
    >> wrote:
    >> I see... Thanks for your answer. I suppose it should be easy enough to
    >> find some of the inconsistencies, such as asymmetrical variant relations,
    >> the real issue would be resolving them case-by-case.
    >> A specific case where resolution, too, seems as though it should be easy
    >> is when supposed Z-variants have quite a different total stroke count. This
    >> can be checked with just the Unihan data, I could do that myself (after
    >> overcoming the usual issues programming languages have with characters
    >> outside the BMP).
    >> Uriah
    >> On Tue, Jun 29, 2010 at 9:36 PM, John H. Jenkins <>wrote:
    >>> The kZVariant field has bad data in it that we haven't had time to clean
    >>> up. It should, in theory, be symmetrical, and it should, in theory, contain
    >>> only unifiable forms, but as you note, it doesn't. In addition to the use
    >>> of the source separation rule, it should also cover characters which were
    >>> added to the standard in error.
    >>> In any event, I'm afraid that right now it's probably best not to rely on
    >>> it for anything.
    >>> On Jun 29, 2010, at 8:25 AM, Uriah Eisenstein wrote:
    >>> Hi,
    >>> To clarify my question with an example :) The character 亀 (U+4E80) is
    >>> listed in Unihan as a Z-variant of 龜 (U+9F9C). However, the opposite is not
    >>> true. Similarly, 疍 (U+758D) is listed as a semantic variant of 蛋 (U+86CB),
    >>> but not vice versa. From the definitions of these variant types in UAX#38,
    >>> one would naturally expect them to be symmetrical, and both characters to
    >>> show each other as variants. There are quite a few other such cases,
    >>> although it does appear that in most cases the relation is symmetrical.
    >>> My reason for asking, BTW, is that I'm thinking of grouping characters
    >>> which are Z-variants of each other in some application, so I need to
    >>> understand whether Z-variants are expected to have clear "cliques" in which
    >>> each character is a Z-variant of all others.
    >>> I realize that the semantic variant relation, at least, is based on
    >>> external sources and not determined by Unicode; regarding Z-variants I'm not
    >>> clear. I'd like to know though whether the relation is expected to be
    >>> symmetrical, and the above cases are to be considered errors; or there is
    >>> some meaning to a one-directional relation; or something else.
    >>> On a side note, some Z-variants I've looked at seem to have very
    >>> different abstract shapes, in some cases looking more like
    >>> simplified/traditional pairs. As I said I don't know clearly how they are
    >>> determined. Are they supposed to be exactly those pairs which would be
    >>> unified if it were not for the Source Separation Rule?
    >>> TIA,
    >>> Uriah
    >>> =====
    >>> John H. Jenkins
    > =====
    > Hoani H. Tinikini
    > John H. Jenkins

    This archive was generated by hypermail 2.1.5 : Tue Aug 17 2010 - 13:05:59 CDT