nameprep, IDN spoofing and the registries

From: Erik van der Poel (erik@vanderpoel.org)
Date: Mon Feb 21 2005 - 23:18:07 CST

  • Next message: Asmus Freytag: "Re: Codepoint Differentiation"

    OK, I think this latest flurry of emails is starting to form a picture
    of What We Must Do (re: nameprep, IDN spoofing and the registries):

    Basically, the work to install filters at the registries, and the work
    to write the next version of the nameprep spec can proceed in parallel,
    pretty much independently.

    As George points out, the registries are going to have to start
    filtering IDN lookalikes, otherwise they will eventually face lawsuits
    from the "big boys" (as George so delightfully puts it). The ccTLDs will
    have a relatively easy task, while the gTLDs like .com will have the
    difficult task of deciding which subset of Unicode to allow. They will
    also have to go through their database, looking for lookalikes, and
    deleting them or transferring them to new owners, probably paying their
    previous owners back. The registrars might have to be involved in the
    money transaction too. What a mess. I don't envy the gTLDs. Maybe the
    Unicode Consortium could help them out by providing homograph tables.

    One possible approach for the gTLDs is to halt IDN registration now.
    Then they can work on their filters, starting with a small subset of
    Unicode. After reopening IDN registration, they can grow the subset if
    there is enough demand for characters outside the initial subset.

    If the gTLDs are going to do some serious subsetting, then they will
    probably also need to provide software to the registrars that will map
    users' characters into the subset. E.g. converting a user's local
    charset to the subset of Unicode. Then again, this might be an area
    where registrars could compete with each other, to provide the most
    friendly software to the end-user (registrant).

    On the other side, we have the nameprep spec, and the work required to
    rev it. As John Klensin points out in another email, nameprep will
    eventually have to be updated to include new Unicode characters.
    Nameprep specifies Unicode 3.2, but Unicode itself is already at 4.0.1,
    and may be even further along by the time we finish discussing and
    drafting nameprep bis (new version). Call it nameprep2.

    Now, one item that is clearly on nameprep2's table is the new version of
    Unicode. Another item that could be considered is the banning of slash
    '/' homographs and others. This type of spoofing was recently discussed
    on the IDN list. Certain Unicode blocks, like the math characters, might
    also be banned instead of mapped as they are now. And I'm sure we would
    discuss mapping or banning the homographs, such as Cyrillic small 'a'. A
    lot of this is likely to be controversial, and some people might suggest
    that we leave the subsetting to the registries, since they have to do it
    anyway. So, instead of shrinking the character set, nameprep2 might just
    grow it (for the new version of Unicode). I don't know. We'll see.

    I'm not sure whether we would need a new ACE prefix if we are only
    adding characters (and not removing any). I'm too tired right now to
    think about it.

    Erik

    George W Gerrity wrote:
    > The two references below summarise much that has been said about the
    > difficulty of dealing with the internationalisation of Domain Names. Let
    > us agree once and for all:
    >
    > 1. The completely general problem is mathematically */and/*
    > computationally intractable, even if we use fuzzy mapping;
    >
    > 2. The problem is a typical engineering challenge to find a workable
    > solution — future-proofed as much as possible — which is minimally complex;
    >
    > 3. If the engineers (us?) don't solve it, the lawyers will have a
    > heyday, the courts will find expensive solutions, the cost of running
    > the web will blow out, and all of us will have mud all over our faces.
    >
    > 4. Now is the time — when there are only a very few registered names
    > with possible clashes — to do it before we */have/* to go through the
    > painful process of unregistering names and upgrading TLD machine codes.
    >
    > So let's sketch out an approach, using <.com.ru> as an example.
    >
    > a) The <.com.ru> registrar only accepts latin characters for that domain
    > name, or only accepts Cyrillic characters, */no mix/*, and maps the two
    > as equivalent. Case-equivalence mapping */may/* also be allowed, at a
    > cost of more complexity. Let the registrar decide that, and let's be
    > sure that as far as possible, the issuing authority licencing the TLD to
    > the registrar ensures legal protection for these */arbitrary/*, but
    > fixed decisions.
    >
    > b) the first filter selects name tags whose codes (including diacritics,
    > etc) are either not all in the Cyrillic block or the Latin block(s) for
    > special attention.
    >
    > My guess is that at this point, only a few percent will require special
    > attention.
    >
    > c) At this point, the <.com.ru> registrar will need to exercise some
    > common sense. For instance, it seems unreasonable that this domain
    > should accept codes outside the Latin and Cyrillic code blocks, and if
    > they do, then mixes should be strongly discouraged. Certainly, the use
    > of, say, Hebrew vowel pointing with Latin Codes, while perhaps
    > acceptable in Israel TLD, should be unacceptable in the Russia TLD. In
    > fact, as a general rule, mixes of diacritics from one code block with
    > code points from another, should never be allowed.
    >
    > Further rules can limit legal sequences of the allowed mixes. For
    > instance, in alphabetic scripts such as Latin and Cyrillic, isolated
    > code points from one script found in another make no sense unless
    > spoofing is intended. Earlier, I suggested that a code-point string of a
    > single script found mixed with strings of other scripts, should be of
    > minimum length 2. One can also limit the number of separate substrings
    > of an alternate script found interspersed with a dominant (national?)
    > script.
    >
    > These sort of common-sense rules can be easily implemented and the
    > computational overhead is minimal. Of course, owners of ridiculous trade
    > marks (such as <U+004B U+0049 U+039B>, “KIΛ”, for the brand name of the
    > automobile “KIA”) will disagree, but realism has to intrude somewhere
    > into the free market economy.
    >
    > The problems for universal TLDs (<.com>, <.net>) are far more complex,
    > because they are required to accept all language scripts. At the TLD
    > itself, one can allow a limited, but finite number of character strings
    > to be equivalent, including the rule that script mixtures are
    > inadmissable, but maybe case folding will be allowed.
    >
    > Once again, however, application of some judicious sieve filters and
    > rules about how mixed scripts may be composed, can simplify the handling
    > of the name tags. There are also sieve rules that can immediately throw
    > out most inadmissable combinations, such as the string length rule
    > mentioned above. Those strings remaining can be tossed to a human, who
    > will be required to be an expert in orthography (nice new line of
    > business for many on the Unicode list?).
    >
    > Now, it doesn't make sense for these rules to be part of a standard on
    > how to extend Domain names to use scripts other than Latin: they are
    > much better handled as (algorithmic where possible) regulations
    > specified by the authority for a given TLD, or set of TLDs, in the case
    > of the universal TLDs.
    >
    > By using this approach, and starting off with a set of rules that
    > disallow most forms of script mixes, then where appeals to common sense
    > and the wishes of a reasonable number of potential clients suggest a
    > loosening of the rules, this can be done with little disruption to the
    > existing state of affairs.
    >
    > George
    > ------
    >
    > On 22 Feb 2005, at 08:40, Doug Ewell wrote:
    >
    > Hans Aberg <haberg at math dot su dot se> wrote:
    >
    > The suggestion I made, was to use a function to detect
    > confusables by
    > declaring them equivalent, but retaining the full Unicode character
    > set for representing the IDN's. If this is used at the registration
    > level only, the only thing that happens when somebody enters a
    > confusable, is that it is rejected. There is a problem only when an
    > authority admits parallel, confusable names to be registered.
    >
    >
    > Granted. The problem, as I have said so often, is determining what the
    > set of "confusables" is. Don't just say a/а and o/ο, either; that's
    > only the tip of the iceberg.
    >
    >
    > On 22 Feb 2005, at 07:03, Erik van der Poel wrote:
    >
    > Hans Aberg wrote:
    >
    > Sure you can change it: One can make the equivalence classes
    > smaller,
    > whenever one wants.
    >
    >
    > As a mathematician, one might be inclined to think that way. But
    > here, we're not talking about theoretical mathematics. We're talking
    > about network engineering. A totally different way of thinking.
    >
    > You can't just change the mapping whenever you want because there
    > are many (client and server) installations out there that can't be
    > changed overnight (what is known in network parlance as a "flag day").
    >
    > For example, even if a registry were to change their mapping, go
    > through their entire database, and delete the names that are
    > determined to be duplicates (however one might accomplish that),
    > there will be people with the old version of the app, which uses the
    > old mapping, and will not be able to find the name (since it has
    > been deleted).
    >
    > Now, this might be a good thing if the name is an evil spoof, but
    > what about innocent registrations? What if two separate parties have
    > an equally legitimate claim on a particular name? This happens a lot
    > in the ASCII DNS, and basically, whoever got there first (or is
    > willing to pay a lot of money) wins.
    >
    > One way to continue to support these innocent duplicates is to use a
    > different prefix (i.e. something other than xn--) in the new
    > mapping, and keep the old names (with the old prefix) in the
    > database (instead of deleting them). This way, the old clients
    > continue to find the old innocent names.
    >
    > But what about the new clients? Now they will suddenly end up on a
    > different Web site when the user clicks on a link. I suppose the
    > user will just have to update their client, or the domain name owner
    > will have to register a different name and update all the Web pages
    > to point to the different name (assuming that they even have control
    > over *all* of the Web pages that might contain a link to their site).
    >
    > And so on. Do you get it now? You can't just change the mapping
    > "whenever" you want. If you do this at all, you do it as few times
    > as possible.
    >
    > Now, you may point out that we are just getting started with IDN and
    > that not very many names have been registered (and I may even agree
    > with you), but it would still take a while to come up with a better
    > mapping (and reach consensus on it -- shudder), and in the meantime,
    > more names would be registered.
    >
    > And this still would not negate my main point, which is that you
    > can't do this "whenever" you want.
    >
    > Erik



    This archive was generated by hypermail 2.1.5 : Mon Feb 21 2005 - 23:20:12 CST