Re: Are Unihan variant relations expected to be symmetrical?

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Aug 17 2010 - 19:28:29 CDT

  • Next message: Mark Davis ☕: "brackets"

    On 8/17/2010 11:44 AM, Mark Davis ☕ wrote:
    > > One would be to use kIICore, since that theoretically flags the most
    > important characters.
    >
    > I would not recommend using kIICore for a measure of importance. I
    > tried recently comparing those characters to the highest frequency Han
    > characters on the web; it does not match well at all.
    Presumably, the lowest frequency characters in IICore are some that were
    included in order to give closure over some pre-existing sets.
    Therefore, one would expect that the highest frequency non-IICore
    characters are more common that the least frequent IICore characters.
    The question, then, is by how much.

    What the percentile of IICore characters that are less frequent than the
    most frequent non-IICore character?
    What's the number of non-IICore characters that are more frequent than
    the least frequent, say 20% of IICore characters? That's an arbitrary
    cut-off, motivated by the usual 80/20 rule.
    Are there any non-IICore characters that are more frequent than 80% of
    the IICore characters?

    How does the data change if you make all counts with overlapping
    standard-deviations "equal" for comparison purposes? I'd use a Poisson
    statistic, so the standard deviation is the square root (or next higher
    integer).

    Is the frequency distribution for non-IICore characters such that it
    starts with robust frequencies (i.e. far enough away from the noisy tail
    that you could expect the numbers not to fluctuate too much (hundreds or
    thousands of hits in the sample)?

    Specifics would be, oh, so interesting.

    A./
    >
    > Mark
    >
    > /— Il meglio è l’inimico del bene —/
    >
    >
    > On Tue, Aug 17, 2010 at 10:26, John H. Jenkins <jenkins@apple.com
    > <mailto:jenkins@apple.com>> wrote:
    >
    >
    > On Aug 17, 2010, at 7:58 AM, Wolfgang Schmidle wrote:
    >
    > > Am 29.06.10 21:36, schrieb John H. Jenkins:
    > >
    > >> The kZVariant field has bad data in it that we haven't had time
    > to clean up. It should, in theory, be symmetrical, and it should,
    > in theory, contain only unifiable forms, but as you note, it
    > doesn't. In addition to the use of the source separation rule, it
    > should also cover characters which were added to the standard in
    > error.
    > >>
    > >> In any event, I'm afraid that right now it's probably best not
    > to rely on it for anything.
    > >
    > >
    > > In the examples I have looked at, the Z-variants are many-to-one
    > relations, with all arrows pointing towards the standard character
    > in the respective class, e.g. 曆 66C6, 歷 6B77, 回 56DE. However,
    > you say that Z-variants are supposed to be symmetrical, and
    > everything else is bad data. How, then, does one find the standard
    > character? Do the "kIICore" characters play a special role here?
    > >
    >
    > Assuming the z-variant data were sufficiently reliable to be
    > useful, then there are a couple of approaches you could use. One
    > would be to use kIICore, since that theoretically flags the most
    > important characters. Otherwise, if you have some z-variants and
    > one is in the Big Five and the others aren't, then the one in the
    > Big Five could be taken as standard for traditional Chinese. You
    > could also use GB0 as the standard for simplified Chinese, or look
    > at the z-variant on the lowest plane in CNS 11643, or something
    > like that.
    >
    > In the end, however, which one is standard may end up being purely
    > arbitrary.
    >
    > > In general, how can searching in Chinese text be formalised? It
    > seems that the Chinese characters cannot easily be divided into
    > equivalence classes where one character in the class should find
    > any other character in this class. If I search for 歴 6B74, I also
    > want to find the semantic variant 歷 6B77 (i.e. the standard
    > character) as well as the simplified character 历 5386. However,
    > if I search for 历 5386, I may want to find the semantic variant
    > 厲 53B2 (which is based on Fenn, but not Lau, Matthews or
    > Meyer-Wempe), but definitely not the simplified character 厉 5389.
    > The difference is that there are additional Z-variant connections
    > in the first case.
    > >
    > > Does it make sense to create equivalence classes from the
    > Z-variants?
    >
    > Not with the data as it stands.
    >
    > > As an example, the 歷 6B77-class would comprise 歴 6B74, 歷 6B77
    > and 历 5386 (not counting the compatibility character 歷 F98C),
    > and the 曆 66C6-class would comprise 66A6 and 曆 66C6 (not
    > counting 曆 F98B). In particular, 曆 66C6 would not find 歷 6B77.
    > However, both characters have the same simplified character
    > equivalent. Should these classes be unified for searching? Or
    > should it make a difference if I search for a traditional or a
    > simplified character, i.e. searching for 历 5386 finds the 曆
    > 66C6-class as well as the 歷 6B77-class?
    > >
    > > Why is 歴 6B74 a semantic variant of 歷 6B77, but 66A6 is not a
    > semantic variant of 曆 66C6? Is it simply because no dictionary
    > has declared them to be equivalent, even though the respective
    > relationships are obviously the same?
    >
    > Yes. One thing that makes this whole process even more
    > complicated than it would otherwise be is that different sources
    > make different judgments as to when two characters are variants of
    > each other. At the moment, this is restricted to data from some
    > of the smaller dictionaries. If and when we can get the variant
    > data from one of the larger dictionaries in place (such as the
    > Hanyu Da Zidian or the Kangxi), then an implementer can simply say
    > that they are normalizing to HYDZD or KX and ignore the remaining
    > variant data.
    >
    > > And how can two characters such as 歴 6B74 and 歷 6B77 be
    > Z-variants if they do not have the same number of strokes? All
    > unification rules seem to leave the number of strokes unchanged,
    > as far as the component is not on the Annex S list of unifiable
    > characters (such as 吕 5415 and 呂 5442).
    > >
    >
    > This is an example of bad data in the kZVariant field.
    >
    > > According to UAX#38, the "kSemanticVariant" relation means "two
    > characters have identical meanings". Thus, technically it should
    > be transitive (as opposed to "kSpecializedSemanticVariant"), but
    > for example 厤 53A4 (kDefinition "to calculate; the calendar") is
    > connected via 曆 66C6 ("calendar, era") with 歷 6B77 ("take place,
    > past, history"), but there is no direct connection. Why?
    > >
    >
    > Our goal at this point is to strictly define the the two semantic
    > variant fields strictly in terms of source dictionaries. In this
    > particular case, Lau and Mathews define U+53A4 and U+66C6 as
    > equivalent, whereas Meyer-Wempe defines U+66C6 and U+6B74 as
    > equivalent. You, the implementer, have the option of deciding
    > which authority you want to base your implementation on.
    >
    > And, unfortunately, the dictionary-makers aren't always going to
    > be careful to provide transitivity (or even reflexivity) in their
    > variant data.
    >
    > > And why does Apple's character palette regard 历 5386 as related
    > to 厯 53AF, when in fact no arrow leads from or to 厯 53AF? Or
    > rather, where does this knowledge (see e.g.
    > http://dict.variants.moe.edu.tw/yitia/fra/fra02074.htm) come from?
    >
    >
    > Information on Apple's source is proprietary. This is true in
    > general of actual implementations. Unihan is rather unusual in at
    > least trying to state the authority based upon which the data is
    > derived.
    >
    > Defining equivalence or normalization for Han is, in general, a
    > very difficult task, not only because of competing authorities but
    > also because of competing languages; normalizing text for Japanese
    > would result in something different from the same text normalized
    > for Chinese. Given the huge number of characters involved, the
    > different competing needs and competing authorities, there isn't a
    > good general solution in place. The goal in Unihan is to provide
    > solid data for implementers to use, but unfortunately we're not
    > quite there yet.
    >
    > =====
    > Hoani H. Tinikini
    > John H. Jenkins
    > jenkins@apple.com <mailto:jenkins@apple.com>
    >
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Aug 17 2010 - 19:33:39 CDT