RE: help finding radical/stroke index at unicode.org

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Apr 15 2004 - 11:59:09 EDT

  • Next message: Patrick Andries: "Re: U+0140"

    Gary P. Grosso wrote:
    > Judging by what we saw in the back of the Unicode 2.0 book,
    > we would tend to say that it is correct that (in an index)
    > 21333 (0x5355) is sorting under 21313 (0x5341) instead of
    > 20843 (0x516b). I am looking for some table of radicals
    > that I can show our customer to help support that claim.
    >
    > Perhaps I should start by asking for opinions on the above
    > sorting, and for guidelines on how best to govern such
    > decisions, [...]

    As Ken Whistler said, you don't necessarily have to make such a decision.

    The usual policy in a dictionary-like radical/stroke index is to put
    ambiguous characters under *multiple* radicals, so that they are easily
    found whatever the reader's assumption. A computer radical/stroke search
    utility is supposed to be at least as user friendly as old paper indices.

    Please notice that the Unihan.txt database contains most of the raw data you
    need to build such a comprehensive index. The data is contained in these
    fields:

            kRSUnicode
            kRSJapanese
            kRSKanWa
            kRSKangXi
            kRSKorean

    <kRSUnicode> is Unicode's default radical/stroke index (the one which was
    used assign the code point to CJK characters), while the other ones are
    alternate radical/stroke from a variety of sources.

    E.g., for character U+5355 ("单" = "lone"), Unihan.txt contains the following
    <kRS...> entries:

            U+5355 kRSKangXi 12.6
            U+5355 kRSUnicode 24.6

    These entries tell you that while Unicode puts U+5355 under the 24th radical
    (U+2F17 = U+5341 = "十" = "ten"), the Kang Xi Zidian dictionary puts it under
    the 12th radical (U+2F0B = U+516B = "八" = "eight").

    Basically, if you extract all the <kRS...> fields, ignore the field
    identifier, sort them by their radical/stroke index, discard duplicates
    (i.e., entries with the same index and code point), and you obtain a list
    quite close to a dictionary-like index with ambiguous characters under
    multiple radicals.

    _ Marco



    This archive was generated by hypermail 2.1.5 : Thu Apr 15 2004 - 13:14:16 EDT