Diacritic and similar foldings and spam filtering

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Jul 08 2004 - 16:35:32 CDT

  • Next message: Doug Ewell: "Re: UTF Magic Pocket Encoders"

    As Sarasvati points out, the thread "Looking for transcription or
    transliteration standards latin- >arabic" had gone way off topic; also I
    understand that some might find António's examples inappropriate. But
    the discussion of diacritic and similar foldings is an important one,
    relevant to Unicode and specifically to the UTR #30 draft. The public
    review period for this has now finished, but in the version to be
    reviewed, http://www.unicode.org/reports/tr30/tr30-3.html, the data file
    for DiacriticRemoval is still "TBD". Is there in fact now a released
    data file or draft, for this folding?

    I made a serious point, not apparently made in the UTR draft, that
    diacritic folding may be useful for spam filtering and similar
    applications including finding misleading URIs. António suggested a
    serious point that for more comprehensive spam filtering an enhanced
    folding might be useful, including such foldings as | > I (capital i)
    and l (small L), 0 (zero) > O, |\/| > M. Would such foldings in fact be
    feasible and useful? They would have to be part of a general similar
    shapes folding. And such a folding would also need to deal with such
    foldings as Cyrillic A and Greek capital alpha > A, as with the whole of
    Unicode available spammers could very easily write ЅРАМ (Cyrillic) or
    SΡΑΜ (mostly Greek) instead of SPAM, in an attempt to defeat spam filtering.

    Could something like this be defined within the framework of UTR #30?
    Should it be defined within the UTR? I suspect it would be better left
    to the discretion of individual developers, who could then rapidly
    tailor their foldings to any new lookalikes exploited by spammers.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Thu Jul 08 2004 - 16:37:22 CDT