Re: Unicode and RFC 4690

From: Steve Summit (scs@eskimo.com)
Date: Wed Oct 04 2006 - 11:48:27 CST

  • Next message: Jukka K. Korpela: "Re: "Visually approximate" conversion from unicode to Windows-1251 (or similar code page)"

    Stephane Bortzmeyer wrote:
    > On Tue, Oct 03, 2006 at 10:22:33PM +0200,
    > Jefsey_Morfin <jefsey@jefsey.com> wrote
    >> RFC 4690 documents a certain number of difficulties resulting of the
    >> choice of Unicode as the reference table of the punycode process.
    >
    > Not at all... The RFC is available here:
    > http://www.ietf.org/rfc/rfc4690.txt
    > and does not discuss the choice of Unicode.

    Well, it does discuss some significant, significant difficulties.
    They're not surprising or new, and they pretty much all revolve
    around: normalization. The problem, as I read that RFC, is that
    DNS requires perhaps the highest levels of accuracy, reliability,
    and security when it comes to normalizing and comparing Unicode
    strings, but has a relatively minimal amount of context available
    to it to assist in the various mapping and matching tasks.

    As we know, normalization is a hard problem. The existing
    normalization algorithms go a long way towards solving it, but of
    course they do not solve it completely. However, what is "good
    enough" for some users of the existing normalization algorithms
    is probably not good enough for DNS.

    I think what Jefsey was asking about was whether anyone has
    done any real work on what we might call the "next phase" of
    normalization, namely that which considers all pairs and sets
    of likely visually-similar glyphs (what RFC 4690 calls
    "confusables"), across all languages and scripts.

    To cite the simplest example: everyone knows that U+0041 Latin
    Capital Letter A, U+0391 Greek Capital Letter Alpha, and U+0410
    Cyrillic Capital Letter A are likely to be visibly very similar,
    if not identical. But, as far as I know, the existing
    normalization algorithms don't touch them. Are there any
    established, respected, comprehensive repositories of such
    equivalences? (I've got my own ad-hoc attempt, but I couldn't
    call it "respected", nor even particularly comprehensive.)

    [P.S. I don't know nearly as much about the existing
    normalization algorithms as I ought to. Apologies in advance
    if I'm utterly wrong in my statement that they don't address
    the A/Α/А issue.]



    This archive was generated by hypermail 2.1.5 : Wed Oct 04 2006 - 11:49:53 CST