Re: Are Unihan variant relations expected to be symmetrical?

From: Uriah Eisenstein (uriaheisenstein@gmail.com)
Date: Fri Aug 20 2010 - 12:50:27 CDT

  • Next message: Peter Constable: "looking for Korean fonts that include U+302E, U+302F"

    Hi again,
    Now I have a skeletal Java GUI application for Unihan SQL access, so I can
    copy-paste the results along with the Chinese characters (doesn't seem
    possible with my Windows console...). Attached is the set of characters with
    kDefinition fields of 1 or 2 letters, they don't seem to make much sense.
    I've also checked the 3-letter definitions but these all seem valid.
    I hope this will be useful, especially as I understand that Unicode 6.0 is
    still in the making so maybe a few fixes could be "slipped in".
    Regards,
    Uriah

    2010/8/17 Uriah Eisenstein <uriaheisenstein@gmail.com>

    > Great :) I'm attaching then the results file which made me raise the
    > original question. I generated it with a Python script, actually, using the
    > 3rd-party cjklib.
    > Each line indicates one asymmetric relation: the character with a variant,
    > the variant type (Z for Z-variant or M for semantic variant), and the
    > variant which does not refer back to the original character.
    > The script also checked for asymmetric Simplified/Traditional pairs, but
    > didn't find any :)
    > HTH,
    > Uriah
    >
    >
    > On Tue, Aug 17, 2010 at 7:05 PM, John H. Jenkins <jenkins@apple.com>wrote:
    >
    >> Any help we can get in cleaning up the Unihan data is greatly appreciated.
    >> It would be very, very useful.
    >>
    >> On Aug 17, 2010, at 3:10 AM, Uriah Eisenstein wrote:
    >>
    >> Hi,
    >> Continuing this issue - I've played a bit with SQL access to Unihan data,
    >> and found also a few kDefinition fields which are only one or two characters
    >> long, e.g. "c" or "lr". I suppose other seemingly erroneous entries could be
    >> found.
    >> My question is, would it be useful if I gather and send such data (which
    >> I'd happily do), or do the Unihan maintainers have enough tools to find it
    >> and just need the time and resources to act on it?
    >>
    >> Regards,
    >> Uriah Eisenstein
    >>
    >> On Wed, Jun 30, 2010 at 11:55 AM, Uriah Eisenstein <
    >> uriaheisenstein@gmail.com> wrote:
    >>
    >>> I see... Thanks for your answer. I suppose it should be easy enough to
    >>> find some of the inconsistencies, such as asymmetrical variant relations,
    >>> the real issue would be resolving them case-by-case.
    >>> A specific case where resolution, too, seems as though it should be easy
    >>> is when supposed Z-variants have quite a different total stroke count. This
    >>> can be checked with just the Unihan data, I could do that myself (after
    >>> overcoming the usual issues programming languages have with characters
    >>> outside the BMP).
    >>>
    >>> Uriah
    >>>
    >>>
    >>> On Tue, Jun 29, 2010 at 9:36 PM, John H. Jenkins <jenkins@apple.com>wrote:
    >>>
    >>>> The kZVariant field has bad data in it that we haven't had time to clean
    >>>> up. It should, in theory, be symmetrical, and it should, in theory, contain
    >>>> only unifiable forms, but as you note, it doesn't. In addition to the use
    >>>> of the source separation rule, it should also cover characters which were
    >>>> added to the standard in error.
    >>>>
    >>>> In any event, I'm afraid that right now it's probably best not to rely
    >>>> on it for anything.
    >>>>
    >>>> On Jun 29, 2010, at 8:25 AM, Uriah Eisenstein wrote:
    >>>>
    >>>> Hi,
    >>>> To clarify my question with an example :) The character 亀 (U+4E80) is
    >>>> listed in Unihan as a Z-variant of 龜 (U+9F9C). However, the opposite is not
    >>>> true. Similarly, 疍 (U+758D) is listed as a semantic variant of 蛋 (U+86CB),
    >>>> but not vice versa. From the definitions of these variant types in UAX#38,
    >>>> one would naturally expect them to be symmetrical, and both characters to
    >>>> show each other as variants. There are quite a few other such cases,
    >>>> although it does appear that in most cases the relation is symmetrical.
    >>>> My reason for asking, BTW, is that I'm thinking of grouping characters
    >>>> which are Z-variants of each other in some application, so I need to
    >>>> understand whether Z-variants are expected to have clear "cliques" in which
    >>>> each character is a Z-variant of all others.
    >>>> I realize that the semantic variant relation, at least, is based on
    >>>> external sources and not determined by Unicode; regarding Z-variants I'm not
    >>>> clear. I'd like to know though whether the relation is expected to be
    >>>> symmetrical, and the above cases are to be considered errors; or there is
    >>>> some meaning to a one-directional relation; or something else.
    >>>> On a side note, some Z-variants I've looked at seem to have very
    >>>> different abstract shapes, in some cases looking more like
    >>>> simplified/traditional pairs. As I said I don't know clearly how they are
    >>>> determined. Are they supposed to be exactly those pairs which would be
    >>>> unified if it were not for the Source Separation Rule?
    >>>>
    >>>> TIA,
    >>>> Uriah
    >>>>
    >>>>
    >>>> =====
    >>>> John H. Jenkins
    >>>> jenkins@apple.com
    >>>>
    >>>>
    >>>>
    >>>
    >>
    >> =====
    >> Hoani H. Tinikini
    >>
    >> John H. Jenkins
    >> jenkins@apple.com
    >>
    >>
    >>
    >





    This archive was generated by hypermail 2.1.5 : Fri Aug 20 2010 - 12:57:46 CDT