Diacritic and similar foldings and spam filtering

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Jul 08 2004 - 16:35:32 CDT

Next message: Doug Ewell: "Re: UTF Magic Pocket Encoders"

Previous message: Donald Z. Osborn: "Re: alphabetic sorting of IPA and other derived letters"
Next in thread: Doug Ewell: "Re: Diacritic and similar foldings and spam filtering"
Reply: Doug Ewell: "Re: Diacritic and similar foldings and spam filtering"
Maybe reply: Kenneth Whistler: "Re: Diacritic and similar foldings and spam filtering"
Maybe reply: Mike Ayers: "RE: Diacritic and similar foldings and spam filtering"
Maybe reply: Mike Ayers: "RE: Diacritic and similar foldings and spam filtering"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

As Sarasvati points out, the thread "Looking for transcription or
transliteration standards latin- >arabic" had gone way off topic; also I
understand that some might find António's examples inappropriate. But
the discussion of diacritic and similar foldings is an important one,
relevant to Unicode and specifically to the UTR #30 draft. The public
review period for this has now finished, but in the version to be
reviewed, http://www.unicode.org/reports/tr30/tr30-3.html, the data file
for DiacriticRemoval is still "TBD". Is there in fact now a released
data file or draft, for this folding?

I made a serious point, not apparently made in the UTR draft, that
diacritic folding may be useful for spam filtering and similar
applications including finding misleading URIs. António suggested a
serious point that for more comprehensive spam filtering an enhanced
folding might be useful, including such foldings as | > I (capital i)
and l (small L), 0 (zero) > O, |\/| > M. Would such foldings in fact be
feasible and useful? They would have to be part of a general similar
shapes folding. And such a folding would also need to deal with such
foldings as Cyrillic A and Greek capital alpha > A, as with the whole of
Unicode available spammers could very easily write ЅРАМ (Cyrillic) or
SΡΑΜ (mostly Greek) instead of SPAM, in an attempt to defeat spam filtering.

Could something like this be defined within the framework of UTR #30?
Should it be defined within the UTR? I suspect it would be better left
to the discretion of individual developers, who could then rapidly
tailor their foldings to any new lookalikes exploited by spammers.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Doug Ewell: "Re: UTF Magic Pocket Encoders"
Previous message: Donald Z. Osborn: "Re: alphabetic sorting of IPA and other derived letters"
Next in thread: Doug Ewell: "Re: Diacritic and similar foldings and spam filtering"
Reply: Doug Ewell: "Re: Diacritic and similar foldings and spam filtering"
Maybe reply: Kenneth Whistler: "Re: Diacritic and similar foldings and spam filtering"
Maybe reply: Mike Ayers: "RE: Diacritic and similar foldings and spam filtering"
Maybe reply: Mike Ayers: "RE: Diacritic and similar foldings and spam filtering"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jul 08 2004 - 16:37:22 CDT