RE: extracting words

From: Makarand Gadre (makag@microsoft.com)
Date: Sun Feb 11 2001 - 02:29:20 EST


Like Edward saud, Getting words from a string is nontrivial. You get similar
issues in Thai. Thai coes not have any space between words, but the script
is Indic based (phonetic). You have to continuously look up the speller and
even then it can't be correct for all cases. E.g.

Sunday or therapist could be interpreted as two words sun & day while the
user meant Sunday etc. In sanskrit, you can create new words by doing a
"sandhi" or conjunction.

Makarand

-----Original Message-----
From: Edward Cherlin [mailto:edward.cherlin.sy.67@aya.yale.edu]
Sent: Sunday, 11 February, 2001 05:34
To: Unicode List
Subject: Re: extracting words

At 1:03 AM -0800 1/29/01, Brahim Mouhdi wrote:
>Hello all,
>
>I'm writing a C-program that is called Blacklist, It's purpose is to
>accept a string (unicode) and extract words from it, then hash the
>found words according to a hashing algorythm and see if the word is in
>blacklist hashtable.
>
>This is all very straightforward, but the problem is the extracting of
>wordsfrom this string. How do i determine what a word is in Japanese or
>Korean or whatever other language? { a space ? }

No. Chinese and Japanese almost never have spaces between words, and
they are not required in Korean.

In Devanagari and related scripts a consonant at the end of a word
can join with a vowel at the beginning of the next word in a single
symbol, so you can't just divide the string into segments. There are
other complications in other writing systems.

The problem is not trivial in Latin alphabet writing, either.
Hyphenated expressions can be quasi-unified words where one or more
components is not a separate word, or ad-hoc, even one-time-only
phrases. The definition of words in a language is also changing.
"Cannot" is currently one word, but used to be two. "An adder" used
to be "a nadder".

>I think somebody must have had this problem and solved it, or maybe my
>approach to the problem is wrong.

Yes, we have had it for a long time; no, nobody has solved it
entirely; and yes, this approach is wrong. Breaking a string into
words may require a thorough understanding of the vocabulary and
grammar of the language, and even that may not be enough.

An example from Korean: Abeojigabangeisseoyo. Should this be segmented as
Abeojiga bange isseoyo (Father is in the room), or as Abeoji gabange
isseoyo (Father is in the bag)?

>I hope somebody can give me some good pointers, directions or
>suggestions.
>
>Thanks for your time,
>
>
>Brahim Mouhdi
>
>{42.}

-- 

Edward Cherlin, Spamfighter <http://www.cauce.org> "It isn't what you don't know that hurts you, it's what you know that ain't so."--Mark Twain, or else some other prominent 19th century humorist and wit



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT