Re: Unihan

From: John H. Jenkins (jenkins@apple.com)
Date: Tue Apr 12 2005 - 11:53:49 CST

  • Next message: Edward H. Trager: "Re: String name and Character Name"

    On Apr 12, 2005, at 10:43 AM, Benjamin C.Kite wrote:

    > I run across a few errors or omissions per week. If there is
    > interest in my input, I'd be happy to offer it to the appropriate
    > parties.
    >

    There is always interest. The correct way to report them is via
    <http://www.unicode.org/reporting.html>.

    > Aside from this I have a few questions:
    >
    > I am curious if Unihan is making private modifications to the
    > definitions, separate from CEDICT, or whether Unihan relies solely
    > on input from CEDICT for its definitions database.
    >

    Unihan does not use CEDICT directly in any way. The Web version of
    the database provides access to CEDICT and EDICT data, but that's as
    far as it goes.

    > Secondly, I notice that the definitions assigned to traditional
    > characters aren't always appended to the definitions of the
    > simplified characters, most especially when the simplified version
    > has its own meaning in the traditional set. It seems trivial to
    > append that information with one more database query. However, I'm
    > curious if there was an extended discussion about whether semantic
    > variants should hold the same definitions as their standard
    > counterparts. There are certainly numerous cases when a semantic
    > variant has no definition data where its standard counterpart
    > does. Should duplicate definitions be propagated here?
    >

    Ideally, yes. It's mostly a matter of finding someone who has the
    time to do the work. The other problem is the fact that the variant
    fields are in a state of constant flux at the moment, and so
    coordinating derivative changes to other fields is a additional chunk
    of work nobody as yet has the time to do. This is particularly true
    of the kDefinition field, which cannot reasonably be updated except
    by hand, since the existing contents have to be parsed to avoid
    duplication.

    > I also notice that there are notations in the definition fields
    > that refer to other characters in three different ways: U+FFFF,
    > VEAFFFF, and also by including the character itself. Does this
    > fall into the demesne of the Unihan group, or is this also CEDICT?
    >

    This has nothing to do with CEDICT. Our goal is to have all
    references (at least in the kDefinition field) include both the
    character and the U+[2]xxxx form, but as yet nobody's had the time to
    do this in a systematic fashion. There should be no VEAxxxx
    references left at this point; if there are, there is an error.

    > Lastly— for the moment— I'm curious whether there is any future
    > plan to include Wubi Hua or ITABC stroke input data to this
    > database. It would seem to be a fairly simple set of data to
    > include, and would make the database more useful, even if only a
    > limited number of characters were included.
    >

    Wubi hua or ITABC stroke input data would be welcome if it were
    properly vetted and volunteered.

    The fundamental problem of the Unihan database is that it's entirely
    maintained by volunteer effort, and the volunteers all have day jobs
    which generally require more attention than it does. The best way to
    get some data included is to provide it yourself.

    ========
    John H. Jenkins
    jenkins@apple.com
    jhjenkins@mac.com
    http://homepage.mac.com/jhjenkins/



    This archive was generated by hypermail 2.1.5 : Tue Apr 12 2005 - 11:55:42 CST