Re: mapping characters with visual similarities

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Mar 07 2008 - 17:42:11 CST

  • Next message: Erkki I. Kolehmainen: "Invitation to Participate in a Workshop on Functional Multilingual Extensions to Keyboard Layouts"

    Take a look at http://www.unicode.org/reports/tr36/ and then
    http://www.unicode.org/reports/tr39/

    Mark

    On Fri, Mar 7, 2008 at 2:30 PM, Chris Weber (Casaba Security) <
    chris@casabasecurity.com> wrote:

    > Hi group I thought this might be the right place to ask this question,
    > and
    > apologize if this has been answered in the past.
    >
    > How can I blacklist a large set of words from a wordlist when all unicode
    > blocks are allowed (e.g. full width latin, cyrillic, etc.)? The scenario
    > would be a web-application written in .Net supporting UTF-8. It consumes
    > a
    > string of input, then compares the string against a wordlist of
    > disallowed, or
    > blacklisted, words.
    >
    > Background:
    > A sample of the blacklist includes profanity and trademark names like:
    > - microsoft
    > - wal-mart
    > - apple
    >
    > Looking at the word 'microsoft' in its UCN form would be:
    > \u006D\u0069\u0063\u0072\u006F\u0073\u006F\u0066\u0074
    >
    > The core of the problem seems to be that any one of these letters can be
    > glyphically (visually) represented using another code point, for example
    > just
    > look at 'some' of the different ways the letter 'm' can be visually
    > represented:
    >
    > м \u043C
    > М \u041C
    > M \uFF2D
    > m \uFF4D
    > ʍ \u028D
    > Μ \u039C
    >
    > So the ideal solution might map every letter against all possible visual
    > representations of that letter. I know that's really tricky business, as
    > even
    > something like 'fn' might look like an 'm' in some fonts, and two v's 'vv'
    >
    > could look like a 'w'. Fonts play a part in this of course, and the
    > problem
    > starts to look unsolvable.
    >
    >

    -- 
    Mark
    


    This archive was generated by hypermail 2.1.5 : Fri Mar 07 2008 - 17:44:31 CST