Re: extracting words

From: Mark Davis (
Date: Sun Feb 11 2001 - 14:19:06 EST

Word break is *very* different than linebreak; see Chapter 5 of TUS, and the
Linebreak TR. For linebreak the only tricky language is Thai, since it
requires a dictionary lookup (much like hyphenation in English). Java (and
ICU) supply linebreak mechanisms as a part of the standard API. They also
supply wordbreak, but it is recognized that those are purely heuristic for
languages such as Chinese and Japanese; the APIs are intended for functions
like double-click, not for dividing text into terms for searching. The
latter is a very complex problem, since if done well requires both division
into words, and extraction of roots: e.g. "go" from "went" and "gone". It is
important to keep these very different processes straight:

- line break (wrapping lines on the screen)
- word break (for selection)
- word/root extraction (for search)

BTW, someone on this thread made this topic out to be even more complex than
is: that Devanagari and Korean are written without spaces. While that may
have been the case historically, I believe that the modern text does use
spaces. Chinese, Japanese and Thai are the main languages written without

----- Original Message -----
From: "Mike Lischke" <>
To: "Unicode List" <>
Sent: Sunday, February 11, 2001 09:47
Subject: FW: extracting words

> Yes, we have had it for a long time; no, nobody has solved it
> entirely; and yes, this approach is wrong. Breaking a string into
> words may require a thorough understanding of the vocabulary and
> grammar of the language, and even that may not be enough.

But how can we then ever have a reliable word-break algorithm? It cannot be
that, say, for a simple editor (be it written in Java or whatever) you have
to supply a database with language specific details just to do automatic
word wrap.

Ciao, Mike

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT