Re: extracting words

From: John Cowan (jcowan@reutershealth.com)
Date: Mon Jan 29 2001 - 11:13:58 EST


Lukas Pietsch wrote:

> This is assuming that what we want is not just a matching of
> *orthographical* words (character strings), but of *lexicographical* words
> (aka lexemes).

But it is impossible in fully cross-linguistic situations in general.
There is simply nothing to do about the fact that "such" is a very
common word, perfectly harmless, in the English language; whereas
in the Nootka language (an Amerindian lg. of the U.S. Pacific
Northwest) it is a vulgarism for the external femal genitalia.
A properly multilingual vulgarism-remover would have to
determine whether the document was English or Nootka before
deciding whether to block "such".

-- 
There is / one art             || John Cowan <jcowan@reutershealth.com>
no more / no less              || http://www.reutershealth.com
to do / all things             || http://www.ccil.org/~cowan
with art- / lessness           \\ -- Piet Hein



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT