Re: IDN problem.... :(

From: Neil Harris (
Date: Mon Feb 14 2005 - 06:25:59 CST

  • Next message: Patrick Andries: "Re: IDN problem.... :("

    Asmus Freytag wrote:

    > At 06:29 PM 2/12/2005, Christopher Fynn wrote:
    >> If there were a list of homographs maybe they could be treated as
    >> aliases
    >> for the purpose of URLs and domain name registration - so IRAQ.COM
    >> with a Latin Q and IRAQ.COM with a Kurdish Q would point to the same
    >> address.
    >> Registering a name containing a character or characters in the
    >> homograph list would automatically get you all the variants too.
    > We discussed this issue during a break at the UTC last week, and I
    > suggested pretty much the same thing. Rather than a true *homograph*
    > mapping, what's needed is a *confusables folding*.
    > If registration authorities could be convinced to use that to block
    > all 'look-alike' registrations, the playground for phishers would
    > shrink dramatically.
    > Instead of having all variant spellings go to the same domain, it
    > might be cleaner to simply have the incorrect variants fail.
    > For ASCII, the confusables folding would fold 0 and O, 1 and l, as
    > well as l and I, which can be confused in many fonts.
    > For Latin / Greek / Cyrillic the set of confusables includes many true
    > homographs, but there are many more for the other scripts.
    > If someone is interested in creating a draft for such a folding, I'd
    > be happy to maintain it as part of my set of draft foldings for future
    > addition to UTS#30. (Speaking of the latter, once 4.1 is out the door,
    > I'll have the time to release the approved version of that UTS).
    > A./
    Based on some of the work that's been going on at , you might want to
    consider a format something like this, using a "confusion metric":

    00ec LATIN SMALL LETTER I WITH GRAVE spoofs i distance 1
    00ed LATIN SMALL LETTER I WITH ACUTE spoofs i distance 1
    012f LATIN SMALL LETTER I WITH OGONEK spoofs i distance 2
    0131 LATIN SMALL LETTER DOTLESS I spoofs i distance 1
    017f LATIN SMALL LETTER LONG S spoofs f distance 1
    0192 LATIN SMALL LETTER F WITH HOOK spoofs f distance 0
    01b6 LATIN SMALL LETTER Z WITH STROKE spoofs z distance 2
    0225 LATIN SMALL LETTER Z WITH HOOK spoofs z distance 2
    0269 LATIN SMALL LETTER IOTA spoofs i distance 1
    03af GREEK SMALL LETTER IOTA WITH TONOS spoofs i distance 1
    03b1 GREEK SMALL LETTER ALPHA spoofs a distance 2
    03b9 GREEK SMALL LETTER IOTA spoofs i distance 1
    03bd GREEK SMALL LETTER NU spoofs v distance 1
    03bf GREEK SMALL LETTER OMICRON spoofs o distance 0
    03c1 GREEK SMALL LETTER RHO spoofs p distance 2
    03c2 GREEK SMALL LETTER FINAL SIGMA spoofs s distance 2
    03c5 GREEK SMALL LETTER UPSILON spoofs v distance 2
    03f2 GREEK LUNATE SIGMA SYMBOL spoofs c distance 0
    03f3 GREEK LETTER YOT spoofs j distance 0
    0430 CYRILLIC SMALL LETTER A spoofs a distance 0
    0433 CYRILLIC SMALL LETTER GHE spoofs r distance 1
    0435 CYRILLIC SMALL LETTER IE spoofs e distance 0
    043e CYRILLIC SMALL LETTER O spoofs o distance 0
    0440 CYRILLIC SMALL LETTER ER spoofs p distance 0
    0441 CYRILLIC SMALL LETTER ES spoofs c distance 0
    0443 CYRILLIC SMALL LETTER U spoofs y distance 0
    0445 CYRILLIC SMALL LETTER HA spoofs x distance 0
    0455 CYRILLIC SMALL LETTER DZE spoofs s distance 0
    0458 CYRILLIC SMALL LETTER JE spoofs j distance 0
    0461 CYRILLIC SMALL LETTER OMEGA spoofs w distance 1
    0491 CYRILLIC SMALL LETTER GHE WITH UPTURN spoofs r distance 1
    04b3 CYRILLIC SMALL LETTER HA WITH DESCENDER spoofs x distance 0
    04bb CYRILLIC SMALL LETTER SHHA spoofs h distance 0

    where "confusion distance" is defined as

    0 => exact homograph
    1 => almost exact homograph
    2 => easily confused, particularly at small font sizes

    There are clearly many more examples that can be added to this table:
    I have generated some seed sets by parsing the Unicode table cross-reference
    data, but the results are too bulky and contain too many false-positives to
    upload at the moment.

    Ideally, these and other data sets can be used to generate pages of candidate
    equivalence sets as HTML, which can then be checked by eye in any
    Unicode-aware Web browser.

    -- Neil

    This archive was generated by hypermail 2.1.5 : Mon Feb 14 2005 - 06:27:32 CST