From: Mark Davis (email@example.com)
Date: Fri Mar 07 2008 - 17:42:11 CST
Take a look at http://www.unicode.org/reports/tr36/ and then
On Fri, Mar 7, 2008 at 2:30 PM, Chris Weber (Casaba Security) <
> Hi group I thought this might be the right place to ask this question,
> apologize if this has been answered in the past.
> How can I blacklist a large set of words from a wordlist when all unicode
> blocks are allowed (e.g. full width latin, cyrillic, etc.)? The scenario
> would be a web-application written in .Net supporting UTF-8. It consumes
> string of input, then compares the string against a wordlist of
> disallowed, or
> blacklisted, words.
> A sample of the blacklist includes profanity and trademark names like:
> - microsoft
> - wal-mart
> - apple
> Looking at the word 'microsoft' in its UCN form would be:
> The core of the problem seems to be that any one of these letters can be
> glyphically (visually) represented using another code point, for example
> look at 'some' of the different ways the letter 'm' can be visually
> м \u043C
> М \u041C
> Ｍ \uFF2D
> ｍ \uFF4D
> ʍ \u028D
> Μ \u039C
> So the ideal solution might map every letter against all possible visual
> representations of that letter. I know that's really tricky business, as
> something like 'fn' might look like an 'm' in some fonts, and two v's 'vv'
> could look like a 'w'. Fonts play a part in this of course, and the
> starts to look unsolvable.
This archive was generated by hypermail 2.1.5 : Fri Mar 07 2008 - 17:44:31 CST