Re: Are Unihan variant relations expected to be symmetrical?

From: Mark Davis ☕ (
Date: Tue Aug 17 2010 - 13:44:32 CDT

  • Next message: Kenneth Whistler: "Re: Deprecated characters in Unicode 5.1 vs Unicode 5.2"

    > One would be to use kIICore, since that theoretically flags the most
    important characters.

    I would not recommend using kIICore for a measure of importance. I tried
    recently comparing those characters to the highest frequency Han characters
    on the web; it does not match well at all.


    *— Il meglio è l’inimico del bene —*

    On Tue, Aug 17, 2010 at 10:26, John H. Jenkins <> wrote:

    > On Aug 17, 2010, at 7:58 AM, Wolfgang Schmidle wrote:
    > > Am 29.06.10 21:36, schrieb John H. Jenkins:
    > >
    > >> The kZVariant field has bad data in it that we haven't had time to clean
    > up. It should, in theory, be symmetrical, and it should, in theory, contain
    > only unifiable forms, but as you note, it doesn't. In addition to the use
    > of the source separation rule, it should also cover characters which were
    > added to the standard in error.
    > >>
    > >> In any event, I'm afraid that right now it's probably best not to rely
    > on it for anything.
    > >
    > >
    > > In the examples I have looked at, the Z-variants are many-to-one
    > relations, with all arrows pointing towards the standard character in the
    > respective class, e.g. 曆 66C6, 歷 6B77, 回 56DE. However, you say that
    > Z-variants are supposed to be symmetrical, and everything else is bad data.
    > How, then, does one find the standard character? Do the "kIICore" characters
    > play a special role here?
    > >
    > Assuming the z-variant data were sufficiently reliable to be useful, then
    > there are a couple of approaches you could use. One would be to use
    > kIICore, since that theoretically flags the most important characters.
    > Otherwise, if you have some z-variants and one is in the Big Five and the
    > others aren't, then the one in the Big Five could be taken as standard for
    > traditional Chinese. You could also use GB0 as the standard for simplified
    > Chinese, or look at the z-variant on the lowest plane in CNS 11643, or
    > something like that.
    > In the end, however, which one is standard may end up being purely
    > arbitrary.
    > > In general, how can searching in Chinese text be formalised? It seems
    > that the Chinese characters cannot easily be divided into equivalence
    > classes where one character in the class should find any other character in
    > this class. If I search for 歴 6B74, I also want to find the semantic variant
    > 歷 6B77 (i.e. the standard character) as well as the simplified character 历
    > 5386. However, if I search for 历 5386, I may want to find the semantic
    > variant 厲 53B2 (which is based on Fenn, but not Lau, Matthews or
    > Meyer-Wempe), but definitely not the simplified character 厉 5389. The
    > difference is that there are additional Z-variant connections in the first
    > case.
    > >
    > > Does it make sense to create equivalence classes from the Z-variants?
    > Not with the data as it stands.
    > > As an example, the 歷 6B77-class would comprise 歴 6B74, 歷 6B77 and 历 5386
    > (not counting the compatibility character 歷 F98C), and the 曆 66C6-class
    > would comprise 66A6 and 曆 66C6 (not counting 曆 F98B). In particular, 曆 66C6
    > would not find 歷 6B77. However, both characters have the same simplified
    > character equivalent. Should these classes be unified for searching? Or
    > should it make a difference if I search for a traditional or a simplified
    > character, i.e. searching for 历 5386 finds the 曆 66C6-class as well as the 歷
    > 6B77-class?
    > >
    > > Why is 歴 6B74 a semantic variant of 歷 6B77, but 66A6 is not a semantic
    > variant of 曆 66C6? Is it simply because no dictionary has declared them to
    > be equivalent, even though the respective relationships are obviously the
    > same?
    > Yes. One thing that makes this whole process even more complicated than it
    > would otherwise be is that different sources make different judgments as to
    > when two characters are variants of each other. At the moment, this is
    > restricted to data from some of the smaller dictionaries. If and when we
    > can get the variant data from one of the larger dictionaries in place (such
    > as the Hanyu Da Zidian or the Kangxi), then an implementer can simply say
    > that they are normalizing to HYDZD or KX and ignore the remaining variant
    > data.
    > > And how can two characters such as 歴 6B74 and 歷 6B77 be Z-variants if
    > they do not have the same number of strokes? All unification rules seem to
    > leave the number of strokes unchanged, as far as the component is not on the
    > Annex S list of unifiable characters (such as 吕 5415 and 呂 5442).
    > >
    > This is an example of bad data in the kZVariant field.
    > > According to UAX#38, the "kSemanticVariant" relation means "two
    > characters have identical meanings". Thus, technically it should be
    > transitive (as opposed to "kSpecializedSemanticVariant"), but for example 厤
    > 53A4 (kDefinition "to calculate; the calendar") is connected via 曆 66C6
    > ("calendar, era") with 歷 6B77 ("take place, past, history"), but there is no
    > direct connection. Why?
    > >
    > Our goal at this point is to strictly define the the two semantic variant
    > fields strictly in terms of source dictionaries. In this particular case,
    > Lau and Mathews define U+53A4 and U+66C6 as equivalent, whereas Meyer-Wempe
    > defines U+66C6 and U+6B74 as equivalent. You, the implementer, have the
    > option of deciding which authority you want to base your implementation on.
    > And, unfortunately, the dictionary-makers aren't always going to be careful
    > to provide transitivity (or even reflexivity) in their variant data.
    > > And why does Apple's character palette regard 历 5386 as related to 厯
    > 53AF, when in fact no arrow leads from or to 厯 53AF? Or rather, where does
    > this knowledge (see e.g.
    > come from?
    > Information on Apple's source is proprietary. This is true in general of
    > actual implementations. Unihan is rather unusual in at least trying to
    > state the authority based upon which the data is derived.
    > Defining equivalence or normalization for Han is, in general, a very
    > difficult task, not only because of competing authorities but also because
    > of competing languages; normalizing text for Japanese would result in
    > something different from the same text normalized for Chinese. Given the
    > huge number of characters involved, the different competing needs and
    > competing authorities, there isn't a good general solution in place. The
    > goal in Unihan is to provide solid data for implementers to use, but
    > unfortunately we're not quite there yet.
    > =====
    > Hoani H. Tinikini
    > John H. Jenkins

    This archive was generated by hypermail 2.1.5 : Tue Aug 17 2010 - 13:47:18 CDT