Re: Are Unihan variant relations expected to be symmetrical?

From: Uriah Eisenstein (uriaheisenstein@gmail.com)
Date: Fri Aug 20 2010 - 12:50:27 CDT

Next message: Peter Constable: "looking for Korean fonts that include U+302E, U+302F"

Previous message: William_J_G Overington: "Re: Accessing alternate glyphs from plain text"
In reply to: Uriah Eisenstein: "Re: Are Unihan variant relations expected to be symmetrical?"
Next in thread: Uriah Eisenstein: "Re: Are Unihan variant relations expected to be symmetrical?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi again,
Now I have a skeletal Java GUI application for Unihan SQL access, so I can
copy-paste the results along with the Chinese characters (doesn't seem
possible with my Windows console...). Attached is the set of characters with
kDefinition fields of 1 or 2 letters, they don't seem to make much sense.
I've also checked the 3-letter definitions but these all seem valid.
I hope this will be useful, especially as I understand that Unicode 6.0 is
still in the making so maybe a few fixes could be "slipped in".
Regards,
Uriah

2010/8/17 Uriah Eisenstein <uriaheisenstein@gmail.com>

> Great :) I'm attaching then the results file which made me raise the
> original question. I generated it with a Python script, actually, using the
> 3rd-party cjklib.
> Each line indicates one asymmetric relation: the character with a variant,
> the variant type (Z for Z-variant or M for semantic variant), and the
> variant which does not refer back to the original character.
> The script also checked for asymmetric Simplified/Traditional pairs, but
> didn't find any :)
> HTH,
> Uriah
>
>
> On Tue, Aug 17, 2010 at 7:05 PM, John H. Jenkins <jenkins@apple.com>wrote:
>
>> Any help we can get in cleaning up the Unihan data is greatly appreciated.
>> It would be very, very useful.
>>
>> On Aug 17, 2010, at 3:10 AM, Uriah Eisenstein wrote:
>>
>> Hi,
>> Continuing this issue - I've played a bit with SQL access to Unihan data,
>> and found also a few kDefinition fields which are only one or two characters
>> long, e.g. "c" or "lr". I suppose other seemingly erroneous entries could be
>> found.
>> My question is, would it be useful if I gather and send such data (which
>> I'd happily do), or do the Unihan maintainers have enough tools to find it
>> and just need the time and resources to act on it?
>>
>> Regards,
>> Uriah Eisenstein
>>
>> On Wed, Jun 30, 2010 at 11:55 AM, Uriah Eisenstein <
>> uriaheisenstein@gmail.com> wrote:
>>
>>> I see... Thanks for your answer. I suppose it should be easy enough to
>>> find some of the inconsistencies, such as asymmetrical variant relations,
>>> the real issue would be resolving them case-by-case.
>>> A specific case where resolution, too, seems as though it should be easy
>>> is when supposed Z-variants have quite a different total stroke count. This
>>> can be checked with just the Unihan data, I could do that myself (after
>>> overcoming the usual issues programming languages have with characters
>>> outside the BMP).
>>>
>>> Uriah
>>>
>>>
>>> On Tue, Jun 29, 2010 at 9:36 PM, John H. Jenkins <jenkins@apple.com>wrote:
>>>
>>>> The kZVariant field has bad data in it that we haven't had time to clean
>>>> up. It should, in theory, be symmetrical, and it should, in theory, contain
>>>> only unifiable forms, but as you note, it doesn't. In addition to the use
>>>> of the source separation rule, it should also cover characters which were
>>>> added to the standard in error.
>>>>
>>>> In any event, I'm afraid that right now it's probably best not to rely
>>>> on it for anything.
>>>>
>>>> On Jun 29, 2010, at 8:25 AM, Uriah Eisenstein wrote:
>>>>
>>>> Hi,
>>>> To clarify my question with an example :) The character 亀 (U+4E80) is
>>>> listed in Unihan as a Z-variant of 龜 (U+9F9C). However, the opposite is not
>>>> true. Similarly, 疍 (U+758D) is listed as a semantic variant of 蛋 (U+86CB),
>>>> but not vice versa. From the definitions of these variant types in UAX#38,
>>>> one would naturally expect them to be symmetrical, and both characters to
>>>> show each other as variants. There are quite a few other such cases,
>>>> although it does appear that in most cases the relation is symmetrical.
>>>> My reason for asking, BTW, is that I'm thinking of grouping characters
>>>> which are Z-variants of each other in some application, so I need to
>>>> understand whether Z-variants are expected to have clear "cliques" in which
>>>> each character is a Z-variant of all others.
>>>> I realize that the semantic variant relation, at least, is based on
>>>> external sources and not determined by Unicode; regarding Z-variants I'm not
>>>> clear. I'd like to know though whether the relation is expected to be
>>>> symmetrical, and the above cases are to be considered errors; or there is
>>>> some meaning to a one-directional relation; or something else.
>>>> On a side note, some Z-variants I've looked at seem to have very
>>>> different abstract shapes, in some cases looking more like
>>>> simplified/traditional pairs. As I said I don't know clearly how they are
>>>> determined. Are they supposed to be exactly those pairs which would be
>>>> unified if it were not for the Source Separation Rule?
>>>>
>>>> TIA,
>>>> Uriah
>>>>
>>>>
>>>> =====
>>>> John H. Jenkins
>>>> jenkins@apple.com
>>>>
>>>>
>>>>
>>>
>>
>> =====
>> Hoani H. Tinikini
>>
>> John H. Jenkins
>> jenkins@apple.com
>>
>>
>>
>

text/plain attachment: short_definitions.txt

Next message: Peter Constable: "looking for Korean fonts that include U+302E, U+302F"
Previous message: William_J_G Overington: "Re: Accessing alternate glyphs from plain text"
In reply to: Uriah Eisenstein: "Re: Are Unihan variant relations expected to be symmetrical?"
Next in thread: Uriah Eisenstein: "Re: Are Unihan variant relations expected to be symmetrical?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Aug 20 2010 - 12:57:46 CDT