Re: Erratum in Unicode book

From: Richard Cook (rscook@socrates.berkeley.edu)
Date: Mon Jul 09 2001 - 14:26:26 EDT


"John H. Jenkins" wrote:
>
> At 11:29 AM -0400 7/9/01, Thomas Chan wrote:
> >On Sun, 8 Jul 2001, James Kass wrote:
> >
> >> An ideal index for the casual or non-CJK user might be quite
> >> different in approach. Perhaps the first component drawn in
> >
> >For the less than proficient user, I think it would be beneficial to have
> >a means to restrict the pool of characters that they are searching
> >amongst--consider the circumstances under which they are likely to have
> >encountered the character they are looking up. The radical-strokes index
> >in TUS3.0 cover over 27,000 characters, many times more than most
> >dictionaries and character sets, and in some places, there are just too
> >many characters falling under a particular radical+residual stroke count
> >for one to scan the page efficiently.
> >
> I've been thinking the same thing. Adding another 40,000+ ideographs
> isn't going to help it. What will be best will be to prepare, again,
> multiple indices, one for just the original Unihan, one for Unihan +
> Extension A, and one for Unihan + Extension A + Extension B.
>
> The other thing I need to do is to make the chart-generating program
> a bit more sophisticated in the order in which it puts the
> ideographs. Right now, all the ideographs for a single
> radical-stroke count are sorted by Unicode scalar value, which means
> that the rare ideographs in Extension A come before the common
> ideographs in the original Unihan block. Either they should be
> ordered the other way or they should be put in strict KangXi order,
> or something. The way it's done now is definitely bad, bad, bad.

John,

I could imagine that it's best not to have to search multiple separate
printed indices, if that's what you mean above.

Rather, simply sort the Ext A and then Ext B items at the end of each
rad/str. The Ext B chars generally have a very low frequency in common
usage, and Ext A a bit higher; a user seeking one of them would then
know to look toward the end of the rad/str count.

One thing: having 5-digit Ext B numbers in the index is going to throw
off your neat grid tabulation. Perhaps the numbers for Ext B can be set
in a smaller typeface?

On a related note, I have 9000 word/char frequencies from Hanyu Pinlu
Cidian (a mainland text; I typed the entries in back in the early 90's,
and this is the freq data currently used in Wenlin). I'd be happy to
give the Consortium access to this data for the purpose of sorting
characters with identical rad/str numbers by frequency.

You're most sensitive to what its failings might be, but I think you did
a very good job on the 3.0 index, very nicely done indeed. I really do
look forward to the next version.

-Richard



This archive was generated by hypermail 2.1.2 : Mon Jul 09 2001 - 13:12:42 EDT