RE: extracting words

From: Christopher John Fynn (cfynn@druknet.net.bt)
Date: Mon Jan 29 2001 - 20:06:10 EST


You might have to apply different rules dependant on the script. In Indic scripts there are often no explicit word boundary markers and you may have to look for grammatical particles. In Tibetan, a string of letters and vowels between two tsheg [0F0B / 0F0C] characters (or other "punctuation") is a morpheme (not that different from a word) - but there are many complex words consisting of two or more such morphemes. (I don't know any CJK languages but I suspect that most individual characters in that block are morphemes as well.)

BTW without determining the language as well as the script, how do you propose to determine if a particular string actually matches a word in your "blacklist" (in terms of meaning) or not? The same string of characters might mean completely different things in two languages that share the same script (/Unicode block).

- Chris

> -----Original Message-----
> From: Brahim Mouhdi [mailto:brahim.mouhdi@cmg.nl]
> Sent: Monday, January 29, 2001 1:03 AM
> To: Unicode List
> Subject: extracting words
>
>
>
> Hello all,
>
> I'm writing a C-program that is called Blacklist, It's purpose is to accept
> a string (unicode) and extract words from it, then hash the found words
> according to a hashing algorythm and see if the word is in blacklist
> hashtable.
>
> This is all very straightforward, but the problem is the extracting of
> wordsfrom this string.
> How do i determine what a word is in Japanese or Korean or whatever other
> language? { a space ? }
> I think somebody must have had this problem and solved it, or maybe my
> approach to the problem is wrong.
>
> I hope somebody can give me some good pointers, directions or suggestions.
>
> Thanks for your time,
>
>
> Brahim Mouhdi
>
> {42.}



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT