Re: CJK Parsing Techniques

From: Mark Leisher (mleisher@crl.nmsu.edu)
Date: Wed Jul 02 1997 - 11:51:14 EDT

Next message: Mark Davis: "Re: MES as an ISO standard?"
Previous message: Pierre Lewis: "re: Java and UTF"
Maybe in reply to: David C. Brauer: "CJK Parsing Techniques"
Next in thread: Misha Wolf: "Re: CJK Parsing Techniques"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

David> I know this question is a bit off the topic of Unicode, but this
David> group seems very aware of the latest in text processing.

    David> The parsing of CJK text to find meaning tokens (word equivalents)
    David> seems to be a daunting problem due to lack of word boundaries. Are
    David> there any techniques, tools or algorithms (free or licensable) that
    David> do a good job of parsing "words" out of a CJK string.

Unfortunately, we may be commercializing our Chinese segmentor, so I can't
give anything out right now, but I can point you at a pretty effective, freely
available Japanese segmenter done at Kyoto U. called JUMAN.

ftp://pine.kuee.kyoto-u.ac.jp/pub/

One thing I want to emphasize: the definition of "word" depends on who you
talk to (at least in Chinese it does). What we do is allow prioritization of
dictionaries that reflect what sort of things should be considered segments
for activities like cursor movement and selections.
-----------------------------------------------------------------------------
mleisher@crl.nmsu.edu
Mark Leisher "A designer knows he has achieved perfection
Computing Research Lab not when there is nothing left to add, but
New Mexico State University when there is nothing left to take away."
Box 30001, Dept. 3CRL -- Antoine de Saint-Exup éry
Las Cruces, NM 88003

Next message: Mark Davis: "Re: MES as an ISO standard?"
Previous message: Pierre Lewis: "re: Java and UTF"
Maybe in reply to: David C. Brauer: "CJK Parsing Techniques"
Next in thread: Misha Wolf: "Re: CJK Parsing Techniques"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT