Re: Erratum in Unicode book

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 09 2001 - 20:53:13 EDT


Thomas Chan wrote:

> On Mon, 9 Jul 2001, Richard Cook wrote:
>
> > On a related note, I have 9000 word/char frequencies from Hanyu Pinlu
> > Cidian (a mainland text; I typed the entries in back in the early 90's,
> > and this is the freq data currently used in Wenlin). I'd be happy to
> > give the Consortium access to this data for the purpose of sorting
> > characters with identical rad/str numbers by frequency.
>
> Wouldn't that bias sorting according to Chinese language usage
> frequencies? e.g., \u7684, \u4f60, \u5403 are very common in Chinese, but
> rare or obscure in Japanese.

The thing we are aiming at is getting the common characters toward the
head of each group with the same radical/residual stroke count. You
could make this much less Chinese- or Japanese-specific by simply combining
the results of a frequency count for Chinese and a frequency count for
Japanese. The exact weighting is not all that important -- if you just
scale the frequency counts to 0.0 - 1.0 and then combine them, counting
any character not in either list as 0.0, you'd end up sifting all the
common Chinese *and* Japanese characters to the front of the sublists.
Then the tail of 0.0-valued characters could be ordered in URO/Ext-A/Ext-B
code point order.

That ought to be optimal for a simple lookup strategy of going to the
start of each sublist in the index and then scanning forward until you
find the character you are looking for (assuming you have counted the
strokes and identified the radical correctly, of course).

For example, for the jiu3 (= sake) example that started this entire
thread, this procedure would clearly put U+9152 at the head of the list
of 3 residual strokes on radical U+9149 (wine radical), rather than
at the end of the list. U+914D and U+914C would probably be next, as
those are also "real" characters that people know. Everything else would
end up in the noise at the end of the list.

This might not seem so obvious for the "sake" case, but try finding some
very common characters like luo4 'fall', ye4 'leaf', and zhu4 'famous'
in the grass radical + 9 strokes list on p. 897 (which contains 122
characters). (And don't miss the two compatibility versions of luo4
and ye4 at the *very* end of the list!) It's a pain in the butt even
for those of us who have spent years with our noses in Chinese dictionaries.

> Subsorting by pronuniciation would also be
> language-dependent.

Yes. This has little to say for it, since you couldn't get a consistent
list for any language, anyway. (There are characters missing a pronunciation
in one language or another, and there are problems of multiple pronunciations
for characters.)

--Ken



This archive was generated by hypermail 2.1.2 : Mon Jul 09 2001 - 19:30:31 EDT