RE: help finding radical/stroke index at unicode.org

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Thu Apr 15 2004 - 11:59:09 EDT

Next message: Patrick Andries: "Re: U+0140"

Previous message: Peter Kirk: "Re: Defective combining sequences and ZW(N)J"
Maybe in reply to: Gary P. Grosso: "help finding radical/stroke index at unicode.org"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Gary P. Grosso wrote:
> Judging by what we saw in the back of the Unicode 2.0 book,
> we would tend to say that it is correct that (in an index)
> 21333 (0x5355) is sorting under 21313 (0x5341) instead of
> 20843 (0x516b). I am looking for some table of radicals
> that I can show our customer to help support that claim.
>
> Perhaps I should start by asking for opinions on the above
> sorting, and for guidelines on how best to govern such
> decisions, [...]

As Ken Whistler said, you don't necessarily have to make such a decision.

The usual policy in a dictionary-like radical/stroke index is to put
ambiguous characters under *multiple* radicals, so that they are easily
found whatever the reader's assumption. A computer radical/stroke search
utility is supposed to be at least as user friendly as old paper indices.

Please notice that the Unihan.txt database contains most of the raw data you
need to build such a comprehensive index. The data is contained in these
fields:

        kRSUnicode
        kRSJapanese
        kRSKanWa
        kRSKangXi
        kRSKorean

<kRSUnicode> is Unicode's default radical/stroke index (the one which was
used assign the code point to CJK characters), while the other ones are
alternate radical/stroke from a variety of sources.

E.g., for character U+5355 ("单" = "lone"), Unihan.txt contains the following
<kRS...> entries:

U+5355 kRSKangXi 12.6
U+5355 kRSUnicode 24.6

These entries tell you that while Unicode puts U+5355 under the 24th radical
(U+2F17 = U+5341 = "十" = "ten"), the Kang Xi Zidian dictionary puts it under
the 12th radical (U+2F0B = U+516B = "八" = "eight").

Basically, if you extract all the <kRS...> fields, ignore the field
identifier, sort them by their radical/stroke index, discard duplicates
(i.e., entries with the same index and code point), and you obtain a list
quite close to a dictionary-like index with ambiguous characters under
multiple radicals.

_ Marco

Next message: Patrick Andries: "Re: U+0140"
Previous message: Peter Kirk: "Re: Defective combining sequences and ZW(N)J"
Maybe in reply to: Gary P. Grosso: "help finding radical/stroke index at unicode.org"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Apr 15 2004 - 13:14:16 EDT