Comments on Unicode Consortium's Draft UTR #36 from Felix Sasaki on 2005-06-02

From: Felix Sasaki (W3)
Date: Thu, 02 Jun 2005 18:22:44 +0900
Message-ID: <429ECFE4.4000901@w3.org>
All,

these are my comments on the Unicode Consortium's Draft UTR#36 document, 
Revision 1.16 (2005/05/09). They are posted here since this document is in 
the review radar of the i18n-core wg and apparently I'm not allowed to 
post to public-i18n-core@w3.org.

All in all it is a good document (modulo certain recommendations, IMHO, 
I'll address that later), but structure is sometimes not respected: For 
instance, though there's a whole section on IDNs (2.1), IDN issues keep 
popping up through the rest of the doc. I am sorry if the following list 
is somehow a mixture of core issues and editorial nits:

Section 1, 4th paragraph: "; and according to what you see it is". Is 
there a piece of sentence missing there?

Section 1, 8th paragrpah: "While some browsers prevent this spoof by 
lowercasing domain names, but others don't". I am not a native speaker, 
but I guess it should be "domain names, others don't".

Section 2.1, 2nd paragraph: It's not actually about IDNs so it shouldn't 
be placed here. Maybe directly under Section 2.

Section 2.1, 3rd paragraph: "using a process called compatibility 
normalization (NFKC)". I guess that a direct reference to RFC 3491 
(Nameprep) would be better placed here, since Nameprep = NFKC + a little 
bit of something else.

Section 2.1, 4th paragraph: ", while the IDNA column shows the IDNA format 
used to represent the string internally in International Domain Names". 
First, the term IDNA is here introduced for the first time without further 
explanation. Second, the column is actually called "IDN Internal", which 
is an unfortunate name, I was expecting the term ACE ("ASCII Compatible 
Encoding") to appear somewhere here. The term "International Domain Names" 
is somehow unfortunate as well (all domain names are an international good 
;-), the correct term is "Internationalized". My proposal for this whole 
sentence is thus: ", while the ACE ("ASCII Compatible Encoding") column 
shows the result of applying the ToASCII() operation (cf RFC 3490) to the 
original IDN, which is the way this IDN is stored and queried in the DNS".

Section 2.1, 7th paragraph: "The IDN processing also removes case 
distinctions by performing a case folding to reduce characters to a 
lowercase form. [...] That means that we can focus on just the lowercase 
characters". While I don't know whether it will be relevant for the 
conclusion "we can focus on just lowercase", there are two remarks that 
must be necessarily made:
* First, the IDNA operation ToASCII() will map to lowercase iff the label 
contains some non-ASCII character. Thus ToASCII("DENIC.DE") = "DENIC.DE", 
because all ASCII. The IDN processing has left the string unchanged.
* Second, domain names are case insensitive, but RFC 1034 and 1035, as 
clarified by 
http://www.ietf.org/internet-drafts/draft-ietf-dnsext-insensitive-05.txt, 
introduce the concept of case preservation. To put it plainly: if I query 
the DNS for "WWW.DENIC.DE", and the DNS contains information for 
"www.denic.de", I will get exactly that information delivered, but the 
answer will be titled "WWW.DENIC.DE".

Section 2.1, 9th paragraph: "two domain names would need to be 
registered". It's a little bit unclear what is meant: Why would that be 
needed? By whom should the be registered? Since this is not a technical 
issue, I'd leave this note best left to the recommendations for the user 
(where it can already be found: 2.10.1.B).

Section 2.1, 9th paragraph: The word "registry" appears for the first time 
without further introduction. For somebody unfamiliar with domain names 
and the ICANN terminology, it can appear to be unclear. I'd drop anyway 
the sentence, because the statement "a registry may want to pay attention 
to this" is more confusing than clarifying.

Section 2.1, 10th paragraph: s/international domain 
names/internationalized domain names/

Section 2.1, 10th paragraph: "the registry can easily determine if a 
proposed registration conflicts". I'd gently drop the valoration "easily": 
given an input label of 63 characters (maximal length of a domain name), 
each of which could be source of an entry in the "confusables" table, and 
with the assumption that there's always only single target for the same 
input (is that always the case?), the potential amount of 2^63 lookups in 
the registration database to be done in realtime in order to work out a 
possible conflict requires more computing power that most of the world 
domain registries can afford today.

Section 2.1, 11th paragraph: I'd add a fourth bullet "Due to the 
decentralized nature of DNS, registries do not control subdomains being 
established beyond the domain name registered". This fact is relevant. 
Together with problems like the one described in RFC 1535 (and God knows 
which more to come) this issue could be a door to a new way of scam.

Section 2.5, 1st example: "to pretend to be a subdomain in" is not 
correct. Better: "to pretend to be a URL under the domain"

Section 2.5, 1st paragraph after the example: "are disallowed by 
StringPrep". Stringprep (no capital P) is introduced for the first time 
without explaining in which way it is relevant to the IDNA standard. I'd 
actually like to stick to a reference to Nameprep (as introduced before), 
which -although just a profile of Stringprep- is directly relevant to 
domain names.

Section 2.5, last but one paragraph: "to always visually distinguish the 
second-level domain". That's a common gotcha: some registries actually 
register at the third-level (greetings to my nominet.org.uk colleagues 
from here :-), and there's no rule that forbids a TLD to register at the 
fourth, fifth.. you just can't carve the second-level in stone.

Section 2.8: Actually very difficult itself to understand for a non-native 
speaker. But since I didn't get it, I can't make any suggestion for 
improvement. Somehow there are a lot of pronouns "this", "both", ... for 
which I can't univocally found the reference.

Section 2.9: The security levels are a good idea, the names are 
problematic though. I wouldn't like to claim that my registry assigns 
domain names at Unicode's "security level minimal", though it's supposed 
to be the second highest in the rank :-). Further: what is the "minimal +" 
or "moderate +" supposed to mean? Please clarify.

Section 2.9, 1st paragraph after the security levels: "characters outside 
of XID_Continue". This can't be unterstood by non-insiders. Please 
clarify.

Section 2.9, 2nd paragraph after the security levels: That is probably 
well-meant, but I wonder whether that suggestions wouldn't be best left to 
usability experts.

Section 2.10: The recommendations are too domain-centric, I would have 
expected to see recommendations for identifiers here.

Section 2.10.1, point A: s/browsers/browsers, mail clients and software in 
general/

Section 2.10.1, point B: "Use the same IP address for both". This 
recommendation bases on the belief, that a registered domain name always 
has an IP address (and promulgates that the Internet is the web), but 
that's not always the case: it could be a domain with only MX records (for 
mail exchange), it even could be a domain which is blocked at the registry 
(and thus can't be found in DNS). But even if all domains would have an IP 
address and a webserver running, I find this a bad recommendation: maybe 
I'd like that my, let's call whole-script confusable domains, point to 
another website with a different message from the original one.

Section labelled "General Programmer Recommendations": incorrectly 
numbered as 2.10.1. Correct following sections, too.

Section 2.10.2, point B.3: "display the domain name with a visually 
highlighted domain name". Unintelligible.

Section 2.10.2, point C.1: "excluding the TLD". Please, don't carve in 
stone that TLDs won't contain characters beyond ASCII in the future.

Section 2.10.2, point D.2: "If the domain has a whole-script confusable, 
verify that both point to the same IP address". While displacing this 
requirement from the registry to the user agent would be an improvement 
towards leveraging the end-to-end design principle of the Internet, how 
should that be practically performed? The client calculating 2^63 label 
permutations and afterwards issuing that amount of DNS queries? Not 
practicable, also consider the previous comments on 2.10.1.B. Please drop 
this.

Section 2.10.3: Strange. The "User recommendations" in section 2.10.1 give 
the impression that this document is encouraging the user (here: domain 
name registrant) to take responsibility for the protection of their 
trademark rights/IPR/security of their domains/etc. I would embrace that. 
And so was the previous version 2 of UTR#36. But suddenly this new draft 
gives an inconsistent twist with itself and includes these new points B.2 
and B.3. Frankly: I don't think it's the task of a domain registry to 
check whether certain domain names belong to the same registrants. Rules 
which recommend that the domains "111.com" and "lll.com" (and "11l.com", 
and "1l1.com", etc.) should belong to the same person haven't been 
followed in the ASCII times and are not programmed to success in the 
advent of IDN. More input from the TLD registry community would be needed 
here.

My 0.02 Euros.

Marcos Sanz
DENIC eG
L2/05-NNN

Comments on Unicode Consortium's Draft UTR #36