Re: Unihan

From: John H. Jenkins (jenkins@apple.com)
Date: Tue Apr 12 2005 - 11:53:49 CST

Next message: Edward H. Trager: "Re: String name and Character Name"

Previous message: Benjamin C.Kite: "Re: Unihan"
In reply to: Benjamin C.Kite: "Re: Unihan"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Apr 12, 2005, at 10:43 AM, Benjamin C.Kite wrote:

> I run across a few errors or omissions per week. If there is
> interest in my input, I'd be happy to offer it to the appropriate
> parties.
>

There is always interest. The correct way to report them is via
<http://www.unicode.org/reporting.html>.

> Aside from this I have a few questions:
>
> I am curious if Unihan is making private modifications to the
> definitions, separate from CEDICT, or whether Unihan relies solely
> on input from CEDICT for its definitions database.
>

Unihan does not use CEDICT directly in any way. The Web version of
the database provides access to CEDICT and EDICT data, but that's as
far as it goes.

> Secondly, I notice that the definitions assigned to traditional
> characters aren't always appended to the definitions of the
> simplified characters, most especially when the simplified version
> has its own meaning in the traditional set. It seems trivial to
> append that information with one more database query. However, I'm
> curious if there was an extended discussion about whether semantic
> variants should hold the same definitions as their standard
> counterparts. There are certainly numerous cases when a semantic
> variant has no definition data where its standard counterpart
> does. Should duplicate definitions be propagated here?
>

Ideally, yes. It's mostly a matter of finding someone who has the
time to do the work. The other problem is the fact that the variant
fields are in a state of constant flux at the moment, and so
coordinating derivative changes to other fields is a additional chunk
of work nobody as yet has the time to do. This is particularly true
of the kDefinition field, which cannot reasonably be updated except
by hand, since the existing contents have to be parsed to avoid
duplication.

> I also notice that there are notations in the definition fields
> that refer to other characters in three different ways: U+FFFF,
> VEAFFFF, and also by including the character itself. Does this
> fall into the demesne of the Unihan group, or is this also CEDICT?
>

This has nothing to do with CEDICT. Our goal is to have all
references (at least in the kDefinition field) include both the
character and the U+[2]xxxx form, but as yet nobody's had the time to
do this in a systematic fashion. There should be no VEAxxxx
references left at this point; if there are, there is an error.

> Lastly— for the moment— I'm curious whether there is any future
> plan to include Wubi Hua or ITABC stroke input data to this
> database. It would seem to be a fairly simple set of data to
> include, and would make the database more useful, even if only a
> limited number of characters were included.
>

Wubi hua or ITABC stroke input data would be welcome if it were
properly vetted and volunteered.

The fundamental problem of the Unihan database is that it's entirely
maintained by volunteer effort, and the volunteers all have day jobs
which generally require more attention than it does. The best way to
get some data included is to provide it yourself.

========
John H. Jenkins
jenkins@apple.com
jhjenkins@mac.com
http://homepage.mac.com/jhjenkins/

Next message: Edward H. Trager: "Re: String name and Character Name"
Previous message: Benjamin C.Kite: "Re: Unihan"
In reply to: Benjamin C.Kite: "Re: Unihan"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 12 2005 - 11:55:42 CST