Re: extracting words

From: Edward Cherlin (
Date: Sat Feb 10 2001 - 19:22:25 EST

At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:
>Hello all,
>I'm writing a C-program that is called Blacklist, It's purpose is to accept
>a string (unicode) and extract words from it, then hash the found words
>according to a hashing algorythm and see if the word is in blacklist
>This is all very straightforward, but the problem is the extracting of
>wordsfrom this string.
>How do i determine what a word is in Japanese or Korean or whatever other
>language? { a space ? }

No. Chinese and Japanese almost never have spaces between words, and
they are not required in Korean.

In Devanagari and related scripts a consonant at the end of a word
can join with a vowel at the beginning of the next word in a single
symbol, so you can't just divide the string into segments. There are
other complications in other writing systems.

The problem is not trivial in Latin alphabet writing, either.
Hyphenated expressions can be quasi-unified words where one or more
components is not a separate word, or ad-hoc, even one-time-only
phrases. The definition of words in a language is also changing.
"Cannot" is currently one word, but used to be two. "An adder" used
to be "a nadder".

>I think somebody must have had this problem and solved it, or maybe my
>approach to the problem is wrong.

Yes, we have had it for a long time; no, nobody has solved it
entirely; and yes, this approach is wrong. Breaking a string into
words may require a thorough understanding of the vocabulary and
grammar of the language, and even that may not be enough.

An example from Korean: Abeojigabangeisseoyo. Should this be segmented as
Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange
isseoyo (Father is in the bag)?

>I hope somebody can give me some good pointers, directions or suggestions.
>Thanks for your time,
>Brahim Mouhdi


Edward Cherlin, Spamfighter <> "It isn't what you don't know that hurts you, it's what you know that ain't so."--Mark Twain, or else some other prominent 19th century humorist and wit

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT