Re: FW: extracting words

From: Tex Texin (
Date: Sun Feb 11 2001 - 13:46:56 EST

If you are willing to give up precision, then you can use heuristics.

The grossest heuristics are not really word breaking at all, but
give users that do not know the language a compatible way of working
with the text. For example, some software have extended their western
European language software which did word breaking with spaces, to
simply break after each ideograph when moving their software to CJK
markets. Although this is in no way "word" breaking, it gives user
a predictable behavior for "control-right-arrow" functions that
executed "next word".

Although it gives some kind of upward and "global" comaptibility,
it does mean that next character and next word do pretty much the
same thing for ideographs.

It's ugly but perhaps ok for a simple editor. You can improve the
with better heuristics and more data, so you get to decide how much is
good enough...


Mike Lischke wrote:
> >
> > Yes, we have had it for a long time; no, nobody has solved it
> > entirely; and yes, this approach is wrong. Breaking a string into
> > words may require a thorough understanding of the vocabulary and
> > grammar of the language, and even that may not be enough.
> But how can we then ever have a reliable word-break algorithm? It cannot be that, say, for a simple editor (be it written in Java or whatever) you have to supply a database with language specific details just to do automatic word wrap.
> Ciao, Mike

According to Murphy, nothing goes according to Hoyle.
Tex Texin                      Director, International Business      +1-781-280-4271 Fax:+1-781-280-4655
Progress Software Corp.        14 Oak Park, Bedford, MA 01730 #1 Embedded Database

Globalization Program ---------------------------------------------------------------------------

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT