Lukas Pietsch wrote:

> This is assuming that what we want is not just a matching of
> *orthographical* words (character strings), but of *lexicographical* words
> (aka lexemes).

But it is impossible in fully cross-linguistic situations in general.
There is simply nothing to do about the fact that "such" is a very
common word, perfectly harmless, in the English language; whereas
in the Nootka language (an Amerindian lg. of the U.S. Pacific
Northwest) it is a vulgarism for the external femal genitalia.
A properly multilingual vulgarism-remover would have to
determine whether the document was English or Nootka before
deciding whether to block "such".

