From: John H. Jenkins (firstname.lastname@example.org)
Date: Tue Aug 17 2010 - 12:05:36 CDT
Any help we can get in cleaning up the Unihan data is greatly appreciated. It would be very, very useful.
On Aug 17, 2010, at 3:10 AM, Uriah Eisenstein wrote:
> Continuing this issue - I've played a bit with SQL access to Unihan data, and found also a few kDefinition fields which are only one or two characters long, e.g. "c" or "lr". I suppose other seemingly erroneous entries could be found.
> My question is, would it be useful if I gather and send such data (which I'd happily do), or do the Unihan maintainers have enough tools to find it and just need the time and resources to act on it?
> Uriah Eisenstein
> On Wed, Jun 30, 2010 at 11:55 AM, Uriah Eisenstein <email@example.com> wrote:
> I see... Thanks for your answer. I suppose it should be easy enough to find some of the inconsistencies, such as asymmetrical variant relations, the real issue would be resolving them case-by-case.
> A specific case where resolution, too, seems as though it should be easy is when supposed Z-variants have quite a different total stroke count. This can be checked with just the Unihan data, I could do that myself (after overcoming the usual issues programming languages have with characters outside the BMP).
> On Tue, Jun 29, 2010 at 9:36 PM, John H. Jenkins <firstname.lastname@example.org> wrote:
> The kZVariant field has bad data in it that we haven't had time to clean up. It should, in theory, be symmetrical, and it should, in theory, contain only unifiable forms, but as you note, it doesn't. In addition to the use of the source separation rule, it should also cover characters which were added to the standard in error.
> In any event, I'm afraid that right now it's probably best not to rely on it for anything.
> On Jun 29, 2010, at 8:25 AM, Uriah Eisenstein wrote:
>> To clarify my question with an example :) The character 亀 (U+4E80) is listed in Unihan as a Z-variant of 龜 (U+9F9C). However, the opposite is not true. Similarly, 疍 (U+758D) is listed as a semantic variant of 蛋 (U+86CB), but not vice versa. From the definitions of these variant types in UAX#38, one would naturally expect them to be symmetrical, and both characters to show each other as variants. There are quite a few other such cases, although it does appear that in most cases the relation is symmetrical.
>> My reason for asking, BTW, is that I'm thinking of grouping characters which are Z-variants of each other in some application, so I need to understand whether Z-variants are expected to have clear "cliques" in which each character is a Z-variant of all others.
>> I realize that the semantic variant relation, at least, is based on external sources and not determined by Unicode; regarding Z-variants I'm not clear. I'd like to know though whether the relation is expected to be symmetrical, and the above cases are to be considered errors; or there is some meaning to a one-directional relation; or something else.
>> On a side note, some Z-variants I've looked at seem to have very different abstract shapes, in some cases looking more like simplified/traditional pairs. As I said I don't know clearly how they are determined. Are they supposed to be exactly those pairs which would be unified if it were not for the Source Separation Rule?
> John H. Jenkins
Hoani H. Tinikini
John H. Jenkins
This archive was generated by hypermail 2.1.5 : Tue Aug 17 2010 - 12:10:38 CDT