Re: CJK Parsing Techniques

From: Mark Leisher (mleisher@crl.nmsu.edu)
Date: Wed Jul 02 1997 - 11:51:14 EDT


    David> I know this question is a bit off the topic of Unicode, but this
    David> group seems very aware of the latest in text processing.

    David> The parsing of CJK text to find meaning tokens (word equivalents)
    David> seems to be a daunting problem due to lack of word boundaries. Are
    David> there any techniques, tools or algorithms (free or licensable) that
    David> do a good job of parsing "words" out of a CJK string.

Unfortunately, we may be commercializing our Chinese segmentor, so I can't
give anything out right now, but I can point you at a pretty effective, freely
available Japanese segmenter done at Kyoto U. called JUMAN.

  ftp://pine.kuee.kyoto-u.ac.jp/pub/

One thing I want to emphasize: the definition of "word" depends on who you
talk to (at least in Chinese it does). What we do is allow prioritization of
dictionaries that reflect what sort of things should be considered segments
for activities like cursor movement and selections.
-----------------------------------------------------------------------------
mleisher@crl.nmsu.edu
Mark Leisher "A designer knows he has achieved perfection
Computing Research Lab not when there is nothing left to add, but
New Mexico State University when there is nothing left to take away."
Box 30001, Dept. 3CRL -- Antoine de Saint-Exup éry
Las Cruces, NM 88003



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT