Re: how to sort by stroke (not radical/stroke)

From: Dan Kogai (dankogai@dan.co.jp)
Date: Tue May 13 2003 - 15:57:53 EDT

  • Next message: Gary P. Grosso: "Re: how to sort by stroke (not radical/stroke)"

    On Wednesday, May 14, 2003, at 01:23 AM, Andrew C. West wrote:
    > That's certainly true, but sorting by Unicode code point will be 90%
    > OK for the
    > 99.99% of CJK data that is encoded within the basic CJK block (and at
    > the
    > radical level it'll probably be 99.9% OK). As a rough and ready method
    > of
    > sorting CJK data it's definitely the most cost effective way of
    > implementing a
    > CJK sort. Like I said, it all depends on what you want it for.

    I wrote a small perl script to see if that is correct.

    #!/usr/local/bin/perl
    use strict;
    use Unicode::Unihan; # get one via CPAN
    my $uh = Unicode::Unihan->new;
    binmode STDOUT => ':utf8';
    for my $ord (0..65535){ # just check BMP
         my $chr = chr($ord);
         my $rs = $uh->RSUnicode($chr);
         defined $rs or next;
         printf "$chr (U+%04x) => $rs\n", $ord;
    }
    __END__

    And here is the part of what it prints.

    㐀 (U+3400) => 1.4
    㐁 (U+3401) => 1.5
    㐂 (U+3402) => 1.5
    㐃 (U+3403) => 2.2
    [snip]
    䶵 (U+4db5) => 214.10
    一 (U+4e00) => 1.0
    丁 (U+4e01) => 1.1

    For U+3400 - U+4DD5 you are roughly right but at U+4E00, "One", the
    simplest of all ideographs, rewinds the "stroke counter". So I have to
    say sorting by Unicode code point to approximate radical/stroke sorting
    is very moot.

    Sorting by code point to yield dictionary order seems a luxury only
    ASCII enjoys. Even ISO-8859-1 fails miserably since all diacritics are
    \x80 and above.

    Dan the Unsorted Man



    This archive was generated by hypermail 2.1.5 : Tue May 13 2003 - 17:02:17 EDT