From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 08 2004 - 18:01:12 CDT
Peter Kirk said:
> I made a serious point, not apparently made in the UTR draft, that
> diacritic folding may be useful for spam filtering and similar
> applications including finding misleading URIs.
This seems like a reasonable point to make and to add to the discussion
of folding in UTR #30.
> António suggested a
> serious point that for more comprehensive spam filtering an enhanced
> folding might be useful, including such foldings as | > I (capital i)
> and l (small L), 0 (zero) > O, |\/| > M. Would such foldings in fact be
> feasible and useful?
Well, someone could try, I suppose, but this stuff tails out pretty
rapidly into mind-boggling complexity, because leetspeek (1337) is
deliberately obscurantist in its own right, let alone as a
spoofing technique to fool spam filters:
http://en.wikipedia.org/wiki/Leet
You can't just fold this stuff into "English" by some kind of
set of transliteration tables -- it really requires an elaborate
system of lexical replacement. It's a *cant* as well as an
obscurantist orthography.
And the leetspeek interleaves with another entire set of conventions
for chatroom abbreviations ("cya l8r"), and it also grades off
into Gangsta.
> They would have to be part of a general similar
> shapes folding.
I think it goes way beyond that. The first level of similar
shapes folding appropriate to Unicode is simply the normal,
shape-based confusion that the well-meaning user of the
characters may have to deal with.
But 1337 can treat "><" as equivalent to "x" and "xXoRs" as
equivalent to "x". The first is somewhat shape-based, but
the latter is just lexical conventions at work.
>
> Could something like this be defined within the framework of UTR #30?
I think it's out of scope.
> Should it be defined within the UTR? I suspect it would be better left
> to the discretion of individual developers, who could then rapidly
> tailor their foldings to any new lookalikes exploited by spammers.
This particular war is currently being won by the spammers,
by the way.
--Ken
This archive was generated by hypermail 2.1.5 : Thu Jul 08 2004 - 18:02:30 CDT