RE: FW: extracting words

From: Mike Lischke (
Date: Sun Feb 11 2001 - 14:39:38 EST

> If you are willing to give up precision, then you can use heuristics.
> It's ugly but perhaps ok for a simple editor. You can improve the
> precision
> with better heuristics and more data, so you get to decide how much is
> good enough...

So using white spaces for general word breaking and ideographs for CJK would be an acceptable
approach? What I wonder about is how to handle all those languages I don't speak/understand (in fact
almost all :-)). Can I used this simple aproach for, say, cherokee and arabic scripts too? I don't
even know which has white spaces and which has not.

Ciao, Mike

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT