RE: Looking up han characters

From: Marco.Cimarosti@icl.com
Date: Thu Jun 29 2000 - 06:14:54 EDT


Robert Lozyniak wrote:
> How do I look up a han character if I don't know its
> codepoint? What if all I have is its shape, or its
> EUC-JP or Shift-JIS number? There are a couple I
> want to see.

If you know the value in JIS (or any other encoding), all just need to look
up a conversion table. There are plenty available on the net; the official
ones for JIS by Unicode are in:

        ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/

If you just have the glyph (e.g., you see in on a newspaper, or on a
tattoo), then you have a more general problem: "how do I look up a
hanzi/kanji on a dictionary, if I don't know the pronunciation?"

There are several different shape-based indexing methods, but only the
following two are widely used:

1. The stroke count method

Background: the sequence of pen strokes needed to trace each hanzi, as well
as their direction and shape, is codified by the rules of Chinese
calligraphy. These rules have to be strictly observed by everybody, not only
for having a "nice calligraphy", but for the very reason that violating them
would result in unreadable characters.

        a) Count the number of strokes that are needed to trace the
character: all characters having the same count are sorted together in a
specific section.
        b) Identify the type of the first stroke (e.g. horizontal line,
vertical line, dot, angle, etc.): within the main stroke-count section, all
characters beginning with that type of stroke are grouped together.
        c) If there are many characters with the same count and same first
stroke, repeat point (b) for second stroke, etc.
        d) You found it! (Depending on the dictionary, you now have the
romanization or the page number, so you can now go to the body of the
dictionary).

2. The radical method

Background: most hanzi are formed by two "components", which are hanzi
themselves squeezed to fit in a single square. The first component is the
"radical" (or "signific", or "key"), and represents the general meaning of
the compound (e.g. the hanzi for "mama, mother" has a radical "female"
because, broadly speaking, a mother is a woman). The other component is the
"phonetic", and it gives some hint about the pronunciation (e.g., the "mama"
hanzi above has a "horse" phonetic, because both "mama" and "horse" sound
"ma" in Chinese, although with different tones).

        a) Look at the various parts of the hanzi, and identify the radical.
Sadly, there are no precise rules for this (although many radicals are
easily recognized by having a fixed position, which is often the right or
top half).
        b) Look up the radical in a radical list. There is no standard list
of radicals: each dictionary has its own choice; however, the 214 radicals
used by a 17th century dictionary (the famous "Kangxi Zidian") is very
well-know, and has been used for many other dictionaries. The list of
radicals itself is ordered with the stroke count method, and it gives you a
radical number.
        c) Go to the main list and find the section corresponding to that
radical number.
        d) Count the number of strokes of the remaining part (i.e. the total
hanzi's strokes minus the radical's strokes). Within the main radical
section, characters are ordered, again, with the stoke-count method based on
the remaining stroke.
        e) Accept the facts: your hanzi is not there! Your assumption about
what component constituted the radical was wrong, so go back at point (a)
and try again...
        f) You found it!

Radical indices found on dictionary are often highly redundant, i.e. all non
obvious character are indexed under more than one radical, in order to
minimize the occurrence of the problem at point (e) above.

The three blocks of ideographs found in Unicode are ordered with the radical
method, using the classical 214 Kangxi radicals. So, theoretically, if you
are provided with a printed list of the 214 keys, you could work out the
Unicode charts directly. In practice, however, this is impossible because
(I) you don't have the section (radical) and sub-section (count) headings in
the Unicode table, and (II) the Kangxi-order of Unicode blocks is not very
consistent, especially when simplified characters pop in, and (III) you have
of course no redundancy to stop you from looping over and over on wrong
assumptions.

The Unicode book contains a proper radical index (with redundancy, and all
the rest) to help you locating ideographs. Sadly, it does not contain a
stroke count index, that is clearly much easier for beginners.

Finally, there is a really cool site where you can experiment with both
methods:

        http://www.zhongwen.com

Here are sample searches for the "mama" ideograph above:

1. Stroke count method sample (http://www.zhongwen.com/s/bishu.htm)

        a) Count: 13 strokes (http://www.zhongwen.com/s/b13.htm).
        b) First stroke: an angle ("<"), so it is towards the end of list
(http://www.zhongwen.com/d/182/x253.htm).
        c) Second stroke: skip.
        d) Found! Now click on "Unihan"
(http://charts.unicode.org/unihan/unihan.acgi$0x5ABD) for info about code
point U+5ABD.

2. Radical method sample (http://www.zhongwen.com/s/bushou.htm)

        a) Identify the radical: we assume it is "horse".
        b) Look up the radical: 10-stroke section.
        c) Go to radical section: 187 (http://www.zhongwen.com/s/r187.htm).
        d) Find of remaining part: 3 strokes, "female" component.
        e) Ooops, it's not there... Loop!
        a) Identify the radical: we now bet it is "female".
        b) Look up the radical: 3-stroke section.
        c) Go to radical section: 38 (http://www.zhongwen.com/s/r38.htm).
        d) Find of remaining part: 10 strokes, "horse" component.
        e) Now it's correct! (http://www.zhongwen.com/d/182/x253.htm)
        f) Found! Now click on "Unihan"
(http://charts.unicode.org/unihan/unihan.acgi$0x5ABD).

Hope this helps.
_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT