Re: extracting words

From: John Cowan (
Date: Mon Jan 29 2001 - 11:13:58 EST

Lukas Pietsch wrote:

> This is assuming that what we want is not just a matching of
> *orthographical* words (character strings), but of *lexicographical* words
> (aka lexemes).

But it is impossible in fully cross-linguistic situations in general.
There is simply nothing to do about the fact that "such" is a very
common word, perfectly harmless, in the English language; whereas
in the Nootka language (an Amerindian lg. of the U.S. Pacific
Northwest) it is a vulgarism for the external femal genitalia.
A properly multilingual vulgarism-remover would have to
determine whether the document was English or Nootka before
deciding whether to block "such".

There is / one art             || John Cowan <>
no more / no less              ||
to do / all things             ||
with art- / lessness           \\ -- Piet Hein

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT