Re: Diacritic and similar foldings and spam filtering

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Jul 09 2004 - 02:59:52 CDT

  • Next message: Peter Kirk: "Re: Looking for transcription or transliteration standards latin- >arabic"

    On 09/07/2004 00:01, Kenneth Whistler wrote:

    >Peter Kirk said:
    >
    >
    >
    >>I made a serious point, not apparently made in the UTR draft, that
    >>diacritic folding may be useful for spam filtering and similar
    >>applications including finding misleading URIs.
    >>
    >>
    >
    >This seems like a reasonable point to make and to add to the discussion
    >of folding in UTR #30.
    >
    >
    >
    >>António suggested a
    >>serious point that for more comprehensive spam filtering an enhanced
    >>folding might be useful, including such foldings as | > I (capital i)
    >>and l (small L), 0 (zero) > O, |\/| > M. Would such foldings in fact be
    >>feasible and useful?
    >>
    >>
    >
    >Well, someone could try, I suppose, but this stuff tails out pretty
    >rapidly into mind-boggling complexity, ...
    >

    Indeed. I wouldn't suggest going beyond the clearly shape-based. But it
    is hard to know where to draw the line, which is another reason to add
    to /|/|ike's good ones for not trying to standardise this. But this kind
    of approach based on UTR #30 may still be helpful for spam filtering
    developers.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Fri Jul 09 2004 - 03:00:34 CDT