Re: Are Unihan variant relations expected to be symmetrical?

From: Uriah Eisenstein (uriaheisenstein@gmail.com)
Date: Sat Aug 21 2010 - 05:31:36 CDT

  • Next message: William_J_G Overington: "RE: Accessing alternate glyphs from plain text"

    This is getting fun :) I've found some duplicates now in some of the reading
    fields (I haven't processed them all yet). kJapaneseKun for 橫 (U+6A6B) just
    has all of its readings twice; kCantonese has several duplications though, I
    can't tell if these should have been different entries and are identical due
    to typos or are just redundant. The results file is attached.
    Also, should kHangul and kKorean be related? There is a rather different
    number of entries for these fields.
    Uriah

    P.S. Please inform me of course if there's anywhere else I should send this
    info.

    On Fri, Aug 20, 2010 at 9:19 PM, John H. Jenkins <jenkins@apple.com> wrote:

    > I'm fleshing them out. Even when they are (technically) correct, the
    > current definitions don't really help anybody know what they're supposed to
    > mean. Most are either units of measure or chemical elements; if I didn't
    > happen to recognize some of the latter, I'd be totally at sea myself.
    >
    > On Aug 20, 2010, at 12:15 PM, Uriah Eisenstein wrote:
    >
    > Interesting indeed. I did suspect that "km" might stand for "kilometre",
    > but most others look to me like gibberish, and anyway if they are
    > abbreviations they could be ambiguous. A full definition would probably be
    > more useful, if one could be found.
    >
    > On Fri, Aug 20, 2010 at 8:04 PM, John H. Jenkins <jenkins@apple.com>wrote:
    >
    >> Thanks. The interesting thing is that most of these are correct, just
    >> unobvious abbreviations. We'll see how manage to slip in.
    >>
    >> On Aug 20, 2010, at 11:50 AM, Uriah Eisenstein wrote:
    >>
    >> Hi again,
    >> Now I have a skeletal Java GUI application for Unihan SQL access, so I can
    >> copy-paste the results along with the Chinese characters (doesn't seem
    >> possible with my Windows console...). Attached is the set of characters with
    >> kDefinition fields of 1 or 2 letters, they don't seem to make much sense.
    >> I've also checked the 3-letter definitions but these all seem valid.
    >> I hope this will be useful, especially as I understand that Unicode 6.0 is
    >> still in the making so maybe a few fixes could be "slipped in".
    >> Regards,
    >> Uriah
    >>
    >> 2010/8/17 Uriah Eisenstein <uriaheisenstein@gmail.com>
    >>
    >>> Great :) I'm attaching then the results file which made me raise the
    >>> original question. I generated it with a Python script, actually, using the
    >>> 3rd-party cjklib.
    >>> Each line indicates one asymmetric relation: the character with a
    >>> variant, the variant type (Z for Z-variant or M for semantic variant), and
    >>> the variant which does not refer back to the original character.
    >>> The script also checked for asymmetric Simplified/Traditional pairs, but
    >>> didn't find any :)
    >>> HTH,
    >>> Uriah
    >>>
    >>>
    >>> On Tue, Aug 17, 2010 at 7:05 PM, John H. Jenkins <jenkins@apple.com>wrote:
    >>>
    >>>> Any help we can get in cleaning up the Unihan data is greatly
    >>>> appreciated. It would be very, very useful.
    >>>>
    >>>> On Aug 17, 2010, at 3:10 AM, Uriah Eisenstein wrote:
    >>>>
    >>>> Hi,
    >>>> Continuing this issue - I've played a bit with SQL access to Unihan
    >>>> data, and found also a few kDefinition fields which are only one or two
    >>>> characters long, e.g. "c" or "lr". I suppose other seemingly erroneous
    >>>> entries could be found.
    >>>> My question is, would it be useful if I gather and send such data (which
    >>>> I'd happily do), or do the Unihan maintainers have enough tools to find it
    >>>> and just need the time and resources to act on it?
    >>>>
    >>>> Regards,
    >>>> Uriah Eisenstein
    >>>>
    >>>> On Wed, Jun 30, 2010 at 11:55 AM, Uriah Eisenstein <
    >>>> uriaheisenstein@gmail.com> wrote:
    >>>>
    >>>>> I see... Thanks for your answer. I suppose it should be easy enough to
    >>>>> find some of the inconsistencies, such as asymmetrical variant relations,
    >>>>> the real issue would be resolving them case-by-case.
    >>>>> A specific case where resolution, too, seems as though it should be
    >>>>> easy is when supposed Z-variants have quite a different total stroke count.
    >>>>> This can be checked with just the Unihan data, I could do that myself (after
    >>>>> overcoming the usual issues programming languages have with characters
    >>>>> outside the BMP).
    >>>>>
    >>>>> Uriah
    >>>>>
    >>>>>
    >>>>> On Tue, Jun 29, 2010 at 9:36 PM, John H. Jenkins <jenkins@apple.com>wrote:
    >>>>>
    >>>>>> The kZVariant field has bad data in it that we haven't had time to
    >>>>>> clean up. It should, in theory, be symmetrical, and it should, in theory,
    >>>>>> contain only unifiable forms, but as you note, it doesn't. In addition to
    >>>>>> the use of the source separation rule, it should also cover characters which
    >>>>>> were added to the standard in error.
    >>>>>>
    >>>>>> In any event, I'm afraid that right now it's probably best not to rely
    >>>>>> on it for anything.
    >>>>>>
    >>>>>> On Jun 29, 2010, at 8:25 AM, Uriah Eisenstein wrote:
    >>>>>>
    >>>>>> Hi,
    >>>>>> To clarify my question with an example :) The character 亀 (U+4E80) is
    >>>>>> listed in Unihan as a Z-variant of 龜 (U+9F9C). However, the opposite is not
    >>>>>> true. Similarly, 疍 (U+758D) is listed as a semantic variant of 蛋 (U+86CB),
    >>>>>> but not vice versa. From the definitions of these variant types in UAX#38,
    >>>>>> one would naturally expect them to be symmetrical, and both characters to
    >>>>>> show each other as variants. There are quite a few other such cases,
    >>>>>> although it does appear that in most cases the relation is symmetrical.
    >>>>>> My reason for asking, BTW, is that I'm thinking of grouping characters
    >>>>>> which are Z-variants of each other in some application, so I need to
    >>>>>> understand whether Z-variants are expected to have clear "cliques" in which
    >>>>>> each character is a Z-variant of all others.
    >>>>>> I realize that the semantic variant relation, at least, is based on
    >>>>>> external sources and not determined by Unicode; regarding Z-variants I'm not
    >>>>>> clear. I'd like to know though whether the relation is expected to be
    >>>>>> symmetrical, and the above cases are to be considered errors; or there is
    >>>>>> some meaning to a one-directional relation; or something else.
    >>>>>> On a side note, some Z-variants I've looked at seem to have very
    >>>>>> different abstract shapes, in some cases looking more like
    >>>>>> simplified/traditional pairs. As I said I don't know clearly how they are
    >>>>>> determined. Are they supposed to be exactly those pairs which would be
    >>>>>> unified if it were not for the Source Separation Rule?
    >>>>>>
    >>>>>> TIA,
    >>>>>> Uriah
    >>>>>>
    >>>>>>
    >>>>>> =====
    >>>>>> John H. Jenkins
    >>>>>> jenkins@apple.com
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>
    >>>>
    >>>> =====
    >>>> Hoani H. Tinikini
    >>>>
    >>>> John H. Jenkins
    >>>> jenkins@apple.com
    >>>>
    >>>>
    >>>>
    >>>
    >> <short_definitions.txt>
    >>
    >>
    >> =====
    >> John H. Jenkins
    >> jenkins@apple.com
    >>
    >>
    >>
    >
    > =====
    > Hoani H. Tinikini
    > John H. Jenkins
    > jenkins@apple.com
    >
    >
    >





    This archive was generated by hypermail 2.1.5 : Sat Aug 21 2010 - 05:37:41 CDT