Re: Are Unihan variant relations expected to be symmetrical?

From: Wolfgang Schmidle (
Date: Tue Aug 17 2010 - 08:58:35 CDT

  • Next message: John H. Jenkins: "Re: Are Unihan variant relations expected to be symmetrical?"

      Am 29.06.10 21:36, schrieb John H. Jenkins:

    > The kZVariant field has bad data in it that we haven't had time to
    > clean up. It should, in theory, be symmetrical, and it should, in
    > theory, contain only unifiable forms, but as you note, it doesn't. In
    > addition to the use of the source separation rule, it should also
    > cover characters which were added to the standard in error.
    > In any event, I'm afraid that right now it's probably best not to rely
    > on it for anything.

    In the examples I have looked at, the Z-variants are many-to-one
    relations, with all arrows pointing towards the standard character in
    the respective class, e.g. 曆 66C6, 歷 6B77, 回 56DE. However, you say
    that Z-variants are supposed to be symmetrical, and everything else is
    bad data. How, then, does one find the standard character? Do the
    "kIICore" characters play a special role here?

    In general, how can searching in Chinese text be formalised? It seems
    that the Chinese characters cannot easily be divided into equivalence
    classes where one character in the class should find any other character
    in this class. If I search for 歴 6B74, I also want to find the semantic
    variant 歷 6B77 (i.e. the standard character) as well as the simplified
    character 历 5386. However, if I search for 历 5386, I may want to find
    the semantic variant 厲 53B2 (which is based on Fenn, but not Lau,
    Matthews or Meyer-Wempe), but definitely not the simplified character 厉
    5389. The difference is that there are additional Z-variant connections
    in the first case.

    Does it make sense to create equivalence classes from the Z-variants? As
    an example, the 歷 6B77-class would comprise 歴 6B74, 歷 6B77 and 历
    5386 (not counting the compatibility character 歷 F98C), and the 曆
    66C6-class would comprise 66A6 and 曆 66C6 (not counting 曆 F98B). In
    particular, 曆 66C6 would not find 歷 6B77. However, both characters
    have the same simplified character equivalent. Should these classes be
    unified for searching? Or should it make a difference if I search for a
    traditional or a simplified character, i.e. searching for 历 5386 finds
    the 曆 66C6-class as well as the 歷 6B77-class?

    Why is 歴 6B74 a semantic variant of 歷 6B77, but 66A6 is not a semantic
    variant of 曆 66C6? Is it simply because no dictionary has declared them
    to be equivalent, even though the respective relationships are obviously
    the same? And how can two characters such as 歴 6B74 and 歷 6B77 be
    Z-variants if they do not have the same number of strokes? All
    unification rules seem to leave the number of strokes unchanged, as far
    as the component is not on the Annex S list of unifiable characters
    (such as 吕 5415 and 呂 5442).

    According to UAX#38, the "kSemanticVariant" relation means "two
    characters have identical meanings". Thus, technically it should be
    transitive (as opposed to "kSpecializedSemanticVariant"), but for
    example 厤 53A4 (kDefinition "to calculate; the calendar") is connected
    via 曆 66C6 ("calendar, era") with 歷 6B77 ("take place, past,
    history"), but there is no direct connection. Why?

    And why does Apple's character palette regard 历 5386 as related to 厯
    53AF, when in fact no arrow leads from or to 厯 53AF? Or rather, where
    does this knowledge (see e.g. come from?


    This archive was generated by hypermail 2.1.5 : Tue Aug 17 2010 - 09:03:35 CDT