From: Wolfgang Schmidle (wschmidle@mpiwg-berlin.mpg.de)
Date: Tue Aug 17 2010 - 08:58:35 CDT
  Am 29.06.10 21:36, schrieb John H. Jenkins:
> The kZVariant field has bad data in it that we haven't had time to 
> clean up.  It should, in theory, be symmetrical, and it should, in 
> theory, contain only unifiable forms, but as you note, it doesn't.  In 
> addition to the use of the source separation rule, it should also 
> cover characters which were added to the standard in error.
>
> In any event, I'm afraid that right now it's probably best not to rely 
> on it for anything.
In the examples I have looked at, the Z-variants are many-to-one 
relations, with all arrows pointing towards the standard character in 
the respective class, e.g. 曆 66C6, 歷 6B77, 回 56DE. However, you say 
that Z-variants are supposed to be symmetrical, and everything else is 
bad data. How, then, does one find the standard character? Do the 
"kIICore" characters play a special role here?
In general, how can searching in Chinese text be formalised? It seems 
that the Chinese characters cannot easily be divided into equivalence 
classes where one character in the class should find any other character 
in this class. If I search for 歴 6B74, I also want to find the semantic 
variant 歷 6B77 (i.e. the standard character) as well as the simplified 
character 历 5386. However, if I search for 历 5386, I may want to find 
the semantic variant 厲 53B2 (which is based on Fenn, but not Lau, 
Matthews or Meyer-Wempe), but definitely not the simplified character 厉 
5389. The difference is that there are additional Z-variant connections 
in the first case.
Does it make sense to create equivalence classes from the Z-variants? As 
an example, the 歷 6B77-class would comprise 歴 6B74, 歷 6B77 and 历 
5386 (not counting the compatibility character 歷 F98C), and the 曆 
66C6-class would comprise 66A6 and 曆 66C6 (not counting 曆 F98B). In 
particular, 曆 66C6 would not find 歷 6B77. However, both characters 
have the same simplified character equivalent. Should these classes be 
unified for searching? Or should it make a difference if I search for a 
traditional or a simplified character, i.e. searching for 历 5386 finds 
the 曆 66C6-class as well as the 歷 6B77-class?
Why is 歴 6B74 a semantic variant of 歷 6B77, but 66A6 is not a semantic 
variant of 曆 66C6? Is it simply because no dictionary has declared them 
to be equivalent, even though the respective relationships are obviously 
the same? And how can two characters such as 歴 6B74 and 歷 6B77 be 
Z-variants if they do not have the same number of strokes? All 
unification rules seem to leave the number of strokes unchanged, as far 
as the component is not on the Annex S list of unifiable characters 
(such as 吕 5415 and 呂 5442).
According to UAX#38, the "kSemanticVariant" relation means "two 
characters have identical meanings". Thus, technically it should be 
transitive (as opposed to "kSpecializedSemanticVariant"), but for 
example 厤 53A4 (kDefinition "to calculate; the calendar") is connected 
via 曆 66C6 ("calendar, era") with 歷 6B77 ("take place, past, 
history"), but there is no direct connection. Why?
And why does Apple's character palette regard 历 5386 as related to 厯 
53AF, when in fact no arrow leads from or to 厯 53AF? Or rather, where 
does this knowledge (see e.g. 
http://dict.variants.moe.edu.tw/yitia/fra/fra02074.htm) come from?
Best,
Wolfgang
This archive was generated by hypermail 2.1.5 : Tue Aug 17 2010 - 09:03:35 CDT