Re: Unicode and RFC 4690

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 04 2006 - 12:49:44 CST

  • Next message: Paul Hastings: "Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"

    Steve Summit wrote:

    > I think what Jefsey was asking about was whether anyone has
    > done any real work on what we might call the "next phase" of
    > normalization, namely that which considers all pairs and sets
    > of likely visually-similar glyphs (what RFC 4690 calls
    > "confusables"), across all languages and scripts.

    This is not actually the "next phase" of normalization, but
    a rather different problem.

    Normalization converts Unicode text into a known form that can
    be compared reliably for equality under the terms of that
    normalization.

    The issue of visual confusability in host names (and IRIs in
    general) relates even to such ASCII-derived confusable pairs
    as O/0 and I/l/1, which no normalization algorithm is going
    to equate without destruction of the interpretation of the text.

    >
    > To cite the simplest example: everyone knows that U+0041 Latin
    > Capital Letter A, U+0391 Greek Capital Letter Alpha, and U+0410
    > Cyrillic Capital Letter A are likely to be visibly very similar,
    > if not identical. But, as far as I know, the existing
    > normalization algorithms don't touch them.

    Nor should they.

    > Are there any
    > established, respected, comprehensive repositories of such
    > equivalences?

    http://www.unicode.org/reports/tr39/

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Oct 04 2006 - 12:51:52 CST