Re: extracting words

From: John Cowan (jcowan@reutershealth.com)
Date: Mon Jan 29 2001 - 11:13:58 EST

Next message: John Cowan: "Re: Benefits of Unicode"
Previous message: Richard, Francois M: "RE: Benefits of Unicode"
Maybe in reply to: Brahim Mouhdi: "extracting words"
Next in thread: Edward Cherlin: "Re: extracting words"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Lukas Pietsch wrote:

> This is assuming that what we want is not just a matching of
> *orthographical* words (character strings), but of *lexicographical* words
> (aka lexemes).

But it is impossible in fully cross-linguistic situations in general.
There is simply nothing to do about the fact that "such" is a very
common word, perfectly harmless, in the English language; whereas
in the Nootka language (an Amerindian lg. of the U.S. Pacific
Northwest) it is a vulgarism for the external femal genitalia.
A properly multilingual vulgarism-remover would have to
determine whether the document was English or Nootka before
deciding whether to block "such".

-- 
There is / one art             || John Cowan <jcowan@reutershealth.com>
no more / no less              || http://www.reutershealth.com
to do / all things             || http://www.ccil.org/~cowan
with art- / lessness           \\ -- Piet Hein

Next message: John Cowan: "Re: Benefits of Unicode"
Previous message: Richard, Francois M: "RE: Benefits of Unicode"
Maybe in reply to: Brahim Mouhdi: "extracting words"
Next in thread: Edward Cherlin: "Re: extracting words"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT