Re: extracting words

From: Lukas Pietsch (pietsch@mail.uni-freiburg.de)
Date: Mon Jan 29 2001 - 07:49:37 EST


Christopher Fynn wrote:

>BTW without determining the language as well as the script, how do you
propose to determine >if a particular string actually matches a word in
your "blacklist" (in terms of meaning) or not? The >same string of
characters might mean completely different things in two languages that
share >the same script (/Unicode block).

This is assuming that what we want is not just a matching of
*orthographical* words (character strings), but of *lexicographical* words
(aka lexemes). Which of course brings with it even more problems. If you
want to filter out all occurrences of, say, a particular verb, you'll have
to look out for all possible grammatical forms of that verb. 5 forms at
maximum in English (go, goes, went, gone, going), but maybe several
hundreds in a heavily inflectional or agglutinative language. In some
languages the set of possible forms of a lexeme may even be open-ended. No
way of doing that without a full-blown morphological parser (which of
course would have to be language-specific.) Looks like this goes a bit
beyond what Brahim is planning to do.

Lukas



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT