From: Kenneth Whistler (email@example.com)
Date: Wed Oct 04 2006 - 12:49:44 CST
Steve Summit wrote:
> I think what Jefsey was asking about was whether anyone has
> done any real work on what we might call the "next phase" of
> normalization, namely that which considers all pairs and sets
> of likely visually-similar glyphs (what RFC 4690 calls
> "confusables"), across all languages and scripts.
This is not actually the "next phase" of normalization, but
a rather different problem.
Normalization converts Unicode text into a known form that can
be compared reliably for equality under the terms of that
The issue of visual confusability in host names (and IRIs in
general) relates even to such ASCII-derived confusable pairs
as O/0 and I/l/1, which no normalization algorithm is going
to equate without destruction of the interpretation of the text.
> To cite the simplest example: everyone knows that U+0041 Latin
> Capital Letter A, U+0391 Greek Capital Letter Alpha, and U+0410
> Cyrillic Capital Letter A are likely to be visibly very similar,
> if not identical. But, as far as I know, the existing
> normalization algorithms don't touch them.
Nor should they.
> Are there any
> established, respected, comprehensive repositories of such
This archive was generated by hypermail 2.1.5 : Wed Oct 04 2006 - 12:51:52 CST