From: George W Gerrity (firstname.lastname@example.org)
Date: Mon Feb 21 2005 - 20:33:41 CST
The two references below summarise much that has been said about the
difficulty of dealing with the internationalisation of Domain Names.
Let us agree once and for all:
1. The completely general problem is mathematically and computationally
intractable, even if we use fuzzy mapping;
2. The problem is a typical engineering challenge to find a workable
solution $B!=(B future-proofed as much as possible $B!=(B which is minimally
3. If the engineers (us?) don't solve it, the lawyers will have a
heyday, the courts will find expensive solutions, the cost of running
the web will blow out, and all of us will have mud all over our faces.
4. Now is the time $B!=(B when there are only a very few registered names
with possible clashes $B!=(B to do it before we have to go through the
painful process of unregistering names and upgrading TLD machine codes.
So let's sketch out an approach, using <.com.ru> as an example.
a) The <.com.ru> registrar only accepts latin characters for that
domain name, or only accepts Cyrillic characters, no mix, and maps the
two as equivalent. Case-equivalence mapping may also be allowed, at a
cost of more complexity. Let the registrar decide that, and let's be
sure that as far as possible, the issuing authority licencing the TLD
to the registrar ensures legal protection for these arbitrary, but
b) the first filter selects name tags whose codes (including
diacritics, etc) are either not all in the Cyrillic block or the Latin
block(s) for special attention.
My guess is that at this point, only a few percent will require special
c) At this point, the <.com.ru> registrar will need to exercise some
common sense. For instance, it seems unreasonable that this domain
should accept codes outside the Latin and Cyrillic code blocks, and if
they do, then mixes should be strongly discouraged. Certainly, the use
of, say, Hebrew vowel pointing with Latin Codes, while perhaps
acceptable in Israel TLD, should be unacceptable in the Russia TLD. In
fact, as a general rule, mixes of diacritics from one code block with
code points from another, should never be allowed.
Further rules can limit legal sequences of the allowed mixes. For
instance, in alphabetic scripts such as Latin and Cyrillic, isolated
code points from one script found in another make no sense unless
spoofing is intended. Earlier, I suggested that a code-point string of
a single script found mixed with strings of other scripts, should be of
minimum length 2. One can also limit the number of separate substrings
of an alternate script found interspersed with a dominant (national?)
These sort of common-sense rules can be easily implemented and the
computational overhead is minimal. Of course, owners of ridiculous
trade marks (such as <U+004B U+0049 U+039B>, $B!H(BKI$B&+!I(B, for the brand name
of the automobile $B!H(BKIA$B!I(B) will disagree, but realism has to intrude
somewhere into the free market economy.
The problems for universal TLDs (<.com>, <.net>) are far more complex,
because they are required to accept all language scripts. At the TLD
itself, one can allow a limited, but finite number of character strings
to be equivalent, including the rule that script mixtures are
inadmissable, but maybe case folding will be allowed.
Once again, however, application of some judicious sieve filters and
rules about how mixed scripts may be composed, can simplify the
handling of the name tags. There are also sieve rules that can
immediately throw out most inadmissable combinations, such as the
string length rule mentioned above. Those strings remaining can be
tossed to a human, who will be required to be an expert in orthography
(nice new line of business for many on the Unicode list?).
Now, it doesn't make sense for these rules to be part of a standard on
how to extend Domain names to use scripts other than Latin: they are
much better handled as (algorithmic where possible) regulations
specified by the authority for a given TLD, or set of TLDs, in the
case of the universal TLDs.
By using this approach, and starting off with a set of rules that
disallow most forms of script mixes, then where appeals to common sense
and the wishes of a reasonable number of potential clients suggest a
loosening of the rules, this can be done with little disruption to the
existing state of affairs.
On 22 Feb 2005, at 08:40, Doug Ewell wrote:
> Hans Aberg <haberg at math dot su dot se> wrote:
>> The suggestion I made, was to use a function to detect confusables by
>> declaring them equivalent, but retaining the full Unicode character
>> set for representing the IDN's. If this is used at the registration
>> level only, the only thing that happens when somebody enters a
>> confusable, is that it is rejected. There is a problem only when an
>> authority admits parallel, confusable names to be registered.
> Granted. The problem, as I have said so often, is determining what the
> set of "confusables" is. Don't just say a/$B'Q(B and o/$B&O(B, either; that's
> only the tip of the iceberg.
On 22 Feb 2005, at 07:03, Erik van der Poel wrote:
> Hans Aberg wrote:
>> Sure you can change it: One can make the equivalence classes smaller,
>> whenever one wants.
> As a mathematician, one might be inclined to think that way. But here,
> we're not talking about theoretical mathematics. We're talking about
> network engineering. A totally different way of thinking.
> You can't just change the mapping whenever you want because there are
> many (client and server) installations out there that can't be changed
> overnight (what is known in network parlance as a "flag day").
> For example, even if a registry were to change their mapping, go
> through their entire database, and delete the names that are
> determined to be duplicates (however one might accomplish that), there
> will be people with the old version of the app, which uses the old
> mapping, and will not be able to find the name (since it has been
> Now, this might be a good thing if the name is an evil spoof, but what
> about innocent registrations? What if two separate parties have an
> equally legitimate claim on a particular name? This happens a lot in
> the ASCII DNS, and basically, whoever got there first (or is willing
> to pay a lot of money) wins.
> One way to continue to support these innocent duplicates is to use a
> different prefix (i.e. something other than xn--) in the new mapping,
> and keep the old names (with the old prefix) in the database (instead
> of deleting them). This way, the old clients continue to find the old
> innocent names.
> But what about the new clients? Now they will suddenly end up on a
> different Web site when the user clicks on a link. I suppose the user
> will just have to update their client, or the domain name owner will
> have to register a different name and update all the Web pages to
> point to the different name (assuming that they even have control over
> *all* of the Web pages that might contain a link to their site).
> And so on. Do you get it now? You can't just change the mapping
> "whenever" you want. If you do this at all, you do it as few times as
> Now, you may point out that we are just getting started with IDN and
> that not very many names have been registered (and I may even agree
> with you), but it would still take a while to come up with a better
> mapping (and reach consensus on it -- shudder), and in the meantime,
> more names would be registered.
> And this still would not negate my main point, which is that you can't
> do this "whenever" you want.
This archive was generated by hypermail 2.1.5 : Mon Feb 21 2005 - 20:34:40 CST