Re: CJK Parsing Techniques

From: Mark Leisher (
Date: Wed Jul 02 1997 - 11:51:14 EDT

    David> I know this question is a bit off the topic of Unicode, but this
    David> group seems very aware of the latest in text processing.

    David> The parsing of CJK text to find meaning tokens (word equivalents)
    David> seems to be a daunting problem due to lack of word boundaries. Are
    David> there any techniques, tools or algorithms (free or licensable) that
    David> do a good job of parsing "words" out of a CJK string.

Unfortunately, we may be commercializing our Chinese segmentor, so I can't
give anything out right now, but I can point you at a pretty effective, freely
available Japanese segmenter done at Kyoto U. called JUMAN.

One thing I want to emphasize: the definition of "word" depends on who you
talk to (at least in Chinese it does). What we do is allow prioritization of
dictionaries that reflect what sort of things should be considered segments
for activities like cursor movement and selections.
