Re: how to sort by stroke (not radical/stroke)

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Wed May 14 2003 - 06:01:38 EDT

  • Next message: Michael Everson: "Re: On BBC2 tonight at 23:20 GMT"

    On Wed, 14 May 2003 04:57:53 +0900, Dan Kogai wrote:

    > For U+3400 - U+4DD5 you are roughly right but at U+4E00, "One", the
    > simplest of all ideographs, rewinds the "stroke counter". So I have to
    > say sorting by Unicode code point to approximate radical/stroke sorting
    > is very moot.

    But I did specify for the "basic CJK block [U+4E00..9FFF] only". If you include
    CJK-A and/or CJK-B it all falls to pieces.

    However, as I said, the vast majority of CJK data in the wild fits within
    U+4E00..9FFF, and you only have to worry about CJK-A or CJK-B if you are dealing
    with atypical Chinese data (such as includes obscure or archaic ideographs, or
    ultra-simplified forms). For standard modern Chinese of the PRC or Taiwanese
    varieties then it is reasonably safe to assume that everything will fit into the
    basic CJK block (given that the basic CJK block is based on pre-Unicode Taiwan
    and PRC coding standards), and a sort by codepoint will yield acceptable results
    for most purposes.

    As John said, there are some inconsistancies in stroke count ordering within
    radical (but these are fairly minor, of the type stroke order = ... 9, 9, 10, 9,
    9, 10, 10, 9, 10 ...), and there are one or two ideographs which are mislocated
    in the wrong radical group (e.g. U+5909), but all in all it's pretty good if all
    you need is an approximate radical/stroke sort.

    Regards,

    Andrew



    This archive was generated by hypermail 2.1.5 : Wed May 14 2003 - 07:03:35 EDT