L2/09-223

From: Mark Davis
Date: Thu, Jun 11, 2009 at 09:34
Subject: Unihan organization


I've had a chance to use the new Unihan files, and here are some observations. I'll send these to the UTC, but wanted to distribute for comments here first.
  1. The high bit is that having the separate files is really useful, so I'm glad it turned out well, and thanks for the work in doing so!
  2. Unihan_DictionaryLikeData.txt - The kDefinitions really stick out like a sore thumb. They probably should be moved into a file called Unihan_Definitions.txt or something like it.
  3. Unihan_NormativeProperties.txt - I had floated having a file of Informative properties. Ken objected to having it, or splitting files on that basis, since we don't want to move stuff around just because its status changes. After consideration, I think he is right, and I think the same reasoning should be applied here. We should rename Unihan_NormativeProperties into: Unihan_Sources.txt, and then put kCompatibilityVariant and kIICore into other files.
    1. The kCompatibilityVariant description says "The compatibility decomposition for this ideograph, derived from the UnicodeData.txt file." If so, it should be in a Unihan_Derived.txt file.
    2. kIICore could go into its own file, or perhaps in one of the others.
  4. Unihan_Readings.txt - Aside from the fact that kHanyuPinlu format as described in #38 doesn't at all match the data, I find the new kHanyuPinlu property to be a real mongrel. It mushes together very different pieces of information: frequency plus reading.

    U+3400    kCantonese    jau1
    U+3400    kMandarin    QIU1
    U+3401    kCantonese    tim2
    U+3401    kHanyuPinyin    10019.020:tiàn
    U+3401    kMandarin    TIAN3 TIAN4

    Since this is a new property, it should be split now. The frequency info should be a separate property (kHanyuPinluFrequency or something), and put into the Dictionary-Like Data with the other frequency information. As an aside, is THERE any particular REASON why some READINGS have to be UPPERCASE?
Comments on #38.

> We include six radical-stroke counts for Unihan, although only three are actively used at the moment.
"only actively used", by whom? What does this mean?

There need to be links on items like kCheungBauerIndex, kCowles,... wherever they occur -- but especially within CategoryListing -- so that we can easily get to the descriptions for items like kZVariant from where they are mentioned.

Dictionary-like Data should be Dictionary-Like Data.

Why use "Other Mappings" for the category and not "Mappings"? What are the main "Mappings"? #38 doesn't make it clear.


Mark