From: John H. Jenkins (email@example.com)
Date: Tue Aug 17 2010 - 12:26:16 CDT
On Aug 17, 2010, at 7:58 AM, Wolfgang Schmidle wrote:
> Am 29.06.10 21:36, schrieb John H. Jenkins:
>> The kZVariant field has bad data in it that we haven't had time to clean up. It should, in theory, be symmetrical, and it should, in theory, contain only unifiable forms, but as you note, it doesn't. In addition to the use of the source separation rule, it should also cover characters which were added to the standard in error.
>> In any event, I'm afraid that right now it's probably best not to rely on it for anything.
> In the examples I have looked at, the Z-variants are many-to-one relations, with all arrows pointing towards the standard character in the respective class, e.g. 曆 66C6, 歷 6B77, 回 56DE. However, you say that Z-variants are supposed to be symmetrical, and everything else is bad data. How, then, does one find the standard character? Do the "kIICore" characters play a special role here?
Assuming the z-variant data were sufficiently reliable to be useful, then there are a couple of approaches you could use. One would be to use kIICore, since that theoretically flags the most important characters. Otherwise, if you have some z-variants and one is in the Big Five and the others aren't, then the one in the Big Five could be taken as standard for traditional Chinese. You could also use GB0 as the standard for simplified Chinese, or look at the z-variant on the lowest plane in CNS 11643, or something like that.
In the end, however, which one is standard may end up being purely arbitrary.
> In general, how can searching in Chinese text be formalised? It seems that the Chinese characters cannot easily be divided into equivalence classes where one character in the class should find any other character in this class. If I search for 歴 6B74, I also want to find the semantic variant 歷 6B77 (i.e. the standard character) as well as the simplified character 历 5386. However, if I search for 历 5386, I may want to find the semantic variant 厲 53B2 (which is based on Fenn, but not Lau, Matthews or Meyer-Wempe), but definitely not the simplified character 厉 5389. The difference is that there are additional Z-variant connections in the first case.
> Does it make sense to create equivalence classes from the Z-variants?
Not with the data as it stands.
> As an example, the 歷 6B77-class would comprise 歴 6B74, 歷 6B77 and 历 5386 (not counting the compatibility character 歷 F98C), and the 曆 66C6-class would comprise 66A6 and 曆 66C6 (not counting 曆 F98B). In particular, 曆 66C6 would not find 歷 6B77. However, both characters have the same simplified character equivalent. Should these classes be unified for searching? Or should it make a difference if I search for a traditional or a simplified character, i.e. searching for 历 5386 finds the 曆 66C6-class as well as the 歷 6B77-class?
> Why is 歴 6B74 a semantic variant of 歷 6B77, but 66A6 is not a semantic variant of 曆 66C6? Is it simply because no dictionary has declared them to be equivalent, even though the respective relationships are obviously the same?
Yes. One thing that makes this whole process even more complicated than it would otherwise be is that different sources make different judgments as to when two characters are variants of each other. At the moment, this is restricted to data from some of the smaller dictionaries. If and when we can get the variant data from one of the larger dictionaries in place (such as the Hanyu Da Zidian or the Kangxi), then an implementer can simply say that they are normalizing to HYDZD or KX and ignore the remaining variant data.
> And how can two characters such as 歴 6B74 and 歷 6B77 be Z-variants if they do not have the same number of strokes? All unification rules seem to leave the number of strokes unchanged, as far as the component is not on the Annex S list of unifiable characters (such as 吕 5415 and 呂 5442).
This is an example of bad data in the kZVariant field.
> According to UAX#38, the "kSemanticVariant" relation means "two characters have identical meanings". Thus, technically it should be transitive (as opposed to "kSpecializedSemanticVariant"), but for example 厤 53A4 (kDefinition "to calculate; the calendar") is connected via 曆 66C6 ("calendar, era") with 歷 6B77 ("take place, past, history"), but there is no direct connection. Why?
Our goal at this point is to strictly define the the two semantic variant fields strictly in terms of source dictionaries. In this particular case, Lau and Mathews define U+53A4 and U+66C6 as equivalent, whereas Meyer-Wempe defines U+66C6 and U+6B74 as equivalent. You, the implementer, have the option of deciding which authority you want to base your implementation on.
And, unfortunately, the dictionary-makers aren't always going to be careful to provide transitivity (or even reflexivity) in their variant data.
> And why does Apple's character palette regard 历 5386 as related to 厯 53AF, when in fact no arrow leads from or to 厯 53AF? Or rather, where does this knowledge (see e.g. http://dict.variants.moe.edu.tw/yitia/fra/fra02074.htm) come from?
Information on Apple's source is proprietary. This is true in general of actual implementations. Unihan is rather unusual in at least trying to state the authority based upon which the data is derived.
Defining equivalence or normalization for Han is, in general, a very difficult task, not only because of competing authorities but also because of competing languages; normalizing text for Japanese would result in something different from the same text normalized for Chinese. Given the huge number of characters involved, the different competing needs and competing authorities, there isn't a good general solution in place. The goal in Unihan is to provide solid data for implementers to use, but unfortunately we're not quite there yet.
Hoani H. Tinikini
John H. Jenkins
This archive was generated by hypermail 2.1.5 : Tue Aug 17 2010 - 12:38:26 CDT