Re: Unihan DB / kKarlgren / kFrequency.

From: John H. Jenkins (
Date: Tue Feb 25 2003 - 10:32:46 EST

  • Next message: Markus Scherer: "Re: UTF-8 question"

    On Sunday, February 23, 2003, at 08:50 AM, Pierpaolo BERNARDI wrote:

    > In the Unihan-3.2.0.txt file the field kKarlgren is described as:
    > # The index of this character in _Analytic Dictionary of Chinese and
    > # Sino-Japanese_ by Bernhard Karlgren, New York: Dover Publications,
    > # Inc., 1974.
    > # If the index is followed by an asterisk (*), then the index is an
    > # interpolated one, indicating where the character would be found
    > # if it were to have been included in the dictionary.
    > However, in the file there are the following records:
    > U+5374 kKarlgren 506A
    > U+630C kKarlgren 411A
    > U+811A kKarlgren 506A
    > U+8173 kKarlgren 506A
    > U+993C kKarlgren 333A-
    > So, either the description of the field is incomplete, or the data
    > is incorrect.

    If you check Karlgren's dictionary, you'll find that while most of the
    indices are integers, there are some indices which are integers
    followed by an "A". This is common in many East Asian dictionaries
    with a numerical order; it typically happens when the basic numeric
    indices are assigned and then an out-of-order entry is discovered. In
    such a case, rather than reset all the indices, an interpolated index
    is added.

    > ----------------------------------------------------
    > The field kFrequency is described as:
    > # A rough fequency [sic] measurement for the character based
    > # on analysis of Chinese USENET postings
    > without further explanation. The field contains one of 1,2,3,4,5.
    > I'd like to know what's, roughly, the meaning of these numbers.

    Roughly, characters with a frequency of 1 are more commonly used than
    those with a frequency of 2, and so on.

    John H. Jenkins

    This archive was generated by hypermail 2.1.5 : Tue Feb 25 2003 - 11:20:30 EST