mapping characters with visual similarities
From: Chris Weber (Casaba Security) (firstname.lastname@example.org)
Next message: Kenneth Whistler: "Re: mapping characters with visual similarities"
Date: Fri Mar 07 2008 - 16:30:30 CST
Hi group I thought this might be the right place to ask this question, and
apologize if this has been answered in the past.
How can I blacklist a large set of words from a wordlist when all unicode
blocks are allowed (e.g. full width latin, cyrillic, etc.)? The scenario
would be a web-application written in .Net supporting UTF-8. It consumes a
string of input, then compares the string against a wordlist of disallowed, or
A sample of the blacklist includes profanity and trademark names like:
Looking at the word 'microsoft' in its UCN form would be:
The core of the problem seems to be that any one of these letters can be
glyphically (visually) represented using another code point, for example just
look at 'some' of the different ways the letter 'm' can be visually represented:
So the ideal solution might map every letter against all possible visual
representations of that letter. I know that's really tricky business, as
something like 'fn' might look like an 'm' in some fonts, and two v's 'vv'
could look like a 'w'. Fonts play a part in this of course, and the
starts to look unsolvable.
This archive was generated by hypermail 2.1.5
: Fri Mar 07 2008 - 16:39:13 CST