Re: Diacritic and similar foldings and spam filtering

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 08 2004 - 18:01:12 CDT

  • Next message: Peter Kirk: "Re: Diacritic and similar foldings and spam filtering"

    Peter Kirk said:

    > I made a serious point, not apparently made in the UTR draft, that
    > diacritic folding may be useful for spam filtering and similar
    > applications including finding misleading URIs.

    This seems like a reasonable point to make and to add to the discussion
    of folding in UTR #30.

    > António suggested a
    > serious point that for more comprehensive spam filtering an enhanced
    > folding might be useful, including such foldings as | > I (capital i)
    > and l (small L), 0 (zero) > O, |\/| > M. Would such foldings in fact be
    > feasible and useful?

    Well, someone could try, I suppose, but this stuff tails out pretty
    rapidly into mind-boggling complexity, because leetspeek (1337) is
    deliberately obscurantist in its own right, let alone as a
    spoofing technique to fool spam filters:

    http://en.wikipedia.org/wiki/Leet

    You can't just fold this stuff into "English" by some kind of
    set of transliteration tables -- it really requires an elaborate
    system of lexical replacement. It's a *cant* as well as an
    obscurantist orthography.

    And the leetspeek interleaves with another entire set of conventions
    for chatroom abbreviations ("cya l8r"), and it also grades off
    into Gangsta.

    > They would have to be part of a general similar
    > shapes folding.

    I think it goes way beyond that. The first level of similar
    shapes folding appropriate to Unicode is simply the normal,
    shape-based confusion that the well-meaning user of the
    characters may have to deal with.

    But 1337 can treat "><" as equivalent to "x" and "xXoRs" as
    equivalent to "x". The first is somewhat shape-based, but
    the latter is just lexical conventions at work.

    >
    > Could something like this be defined within the framework of UTR #30?

    I think it's out of scope.

    > Should it be defined within the UTR? I suspect it would be better left
    > to the discretion of individual developers, who could then rapidly
    > tailor their foldings to any new lookalikes exploited by spammers.

    This particular war is currently being won by the spammers,
    by the way.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Jul 08 2004 - 18:02:30 CDT