Re: Unicode and RFC 4690

From: Neil Harris (neil@tonal.clara.co.uk)
Date: Thu Oct 05 2006 - 18:51:29 CST

  • Next message: Neil Harris: "Re: Unicode and RFC 4690"

    Jefsey_Morfin wrote:
    > There is a confusion between the need (IDN) and a solution (IETF
    > IDNA). Regulating the need will not correct the lacks of the solution.
    > The solution must be fool proof. How? In having all the confusive
    > strings being converted into the same ACE ("xn--" ASCII equivalence).
    >
    > Is it possible? Yes. A grapheme is a graphic concept which can be
    > mathematically documented. The problem is that Unicode assigns numbers
    > to these concepts in a polyonomous manner. So we need a
    > Unicode/Grapheme table. Either in comparing the characters' mathematic
    > descriptions through their integrals (graphemes). Or in capitalizing
    > on experience. To obtain a table of characters graphic families (
    > another way to list graphemes).
    >
    > A "super punycode" version will use this table to transcode in the
    > same ASCII sequence all the characters of the same family. This will
    > remove none of the possibilities of the current solution, but it will
    > prevent two different ACE from being seen in the same way. Because
    > there will be only one possible ACE possible. This will not reduce the
    > possibility of that ACE to fully support all the existing confusive
    > labels. The disadvantages of not using that "super punycode" function
    > will probably make it used quickly. The drawback is that some existing
    > names may be confused with other names.This is why the need is urgent
    > (there is a limited number of IDNs and no many confusive ones
    > [confusive labels are at higher levels]). If this was a real
    > difficulty, the solution is proposed is to use another prefix than
    > "xn--" (this would help addressing another type of problem).
    >
    > jfc
    >
    >

    I've actually written code to try to work out homograph resemblances.
    It's harder than you might think, and brings up a huge range of problems
    related to visual perception.

    Graphemes are actually rather hard-to-define entities, and if anything
    harder to define than characters. Consider the huge differences in
    letterforms found between fonts that are in widespread use, and then
    doing this between hundreds of font variants in dozens of writing
    systems -- and that's before you even start to consider Chinese, where
    confusables can occur for cultural, not visual, reasons. Douglas
    Hofstadter devoted a lot of time to thinking about and demonstrating the
    possibilities of this in his book "Fluid Concepts and Creative Analogies".

    However, there's a bigger issue: even if your approach was to be the One
    True Way for doing IDN, I wouldn't hold your breath waiting for it to
    happen. The IDN process has already taken more than five years, and
    there are already over twenty live IDN-enabled domains already operating
    [1], and the big three browsers, and increasingly other software, can
    now all support the current IDN implementation. Changing it now would
    make turning a supertanker around look easy by comparison. [2]

    -- Neil

    [1] See
    http://www.mozilla.org/projects/security/tld-idn-policy-list.html for a
    partial list of IDN-enabled domains

    [2] not to mention the other exciting technical issues that would be
    involved in ever changing the ACE prefix



    This archive was generated by hypermail 2.1.5 : Thu Oct 05 2006 - 18:52:15 CST