L2/09-038 - Mark Davis

IDNAbis - Implementation Questions

In looking at how to implement 2008 (and maintain backward compatibility), we are wrestling with some practical questions that we'd appreciate feedback on.

Scenario

Look at the following scenario, where we have three processes that handle an IRI (perhaps just passing it through), with the final one using it to access the DNS. (We'll use the term IRI for both when the domain name labels are in punycode or in Unicode. They aren't necessarily known to be A-Labels or U-Labels at any given point.)

P1 => P2 => P3 => P4 => DNS

Variables and Background

These processes may be within the same system, or they may be passing IRIs across the web (embodied in HTML5 doc, email, XLink, etc.) to other systems or operating systems. For example, P1 could be a web server hosting a web page, P2 may be a search engine indexer, P3 could be a search engine results supply, and P4 could be a browser. Or these could all be cooperating processes within a search engine indexer.

There are a lot of variables here:

  • Each of P1..P4 could convert an IRI to punycode before sending it on.
  • For that matter, any of them could convert back from punycode to Unicode for use internally, or pass that Unicode form on (IRIs with Unicode are recommended by the W3C in their protocols).
  • Each of the processes could be doing validity checks to determine whether the domain name is valid or not. Such a check may be partial (as in the current protocol spec, which doesn't require checking CONTEXT or BIDI), or full. (The check for validity is orthogonal to whether the form is Unicode or punycode.)
  • Each of the processes may be on IDNA2003, or on IDNA2008, or on some hybrid for compatibility.
  • For IDNA2008 implementations, each might be on a different version of Unicode.

Examples: IE6 only handles punycode, and won't do any validity checking. IE7 handles both punycode and Unicode. It checks the punycode, so a valid IDNA2008 IRI with a ZWJ will fail. There are still enough IE6 implementations around that we (and others) need to handle them, and for years to come there will be IE7 implementations around. Not to speak of other browsers, emailers, word processors, etc. that handle URL/IRIs based on IDNA2003.

Note: even if validity checking is done on an IRI, non-registries don't need to include the tests for BIDI or CONTEXT, so there is no guarantee that a punycode form is an A-Label or that a Unicode form is a U-Label.

Questions


1. Suppose that P2 is on Unicode 5.1, and the others are on Unicode 6.0. If P2 does a validity check, then it could prevent a perfectly valid IRI from being correctly looked up. To prevent this problem, does that mean that the best practice is for only P4 to do validity checking? Or should the others do some weaker form of validity checking, like skipping a check for UNASSIGNED?

2. Suppose P3 is a non-IDNA aware process, so IRIs should be converted to Punycode by P2 before sending. Should one do a validity check in P2? How do we avoid problem #1 in that case?

3. The current protocol spec appears to only require validity checking when converting to punycode. So when an IRI is already in punycode (which could have been from IDNA2003 application), it might not  undergo any checking at all when going from P1 to the DNS; so everything depends on the registry's doing the right thing. Is it best to check anyway, or does that run into problem #1?

4. If P2 accepts an IRI in Unicode and passes it on to P3 in Unicode (never converting to punycode), should it do any validity checking?

5. When a search engine does indexing, it has to map together IRIs that are "equivalent" (resolving to the same logical location). When it provides an IRI to the user for a page, that IRI should go to the indexed page. However, because IDNA2003 and IDNA2008 browsers may go to different places with the same IRI, which do we provide? If we try to test for which browser the user has, that is clumsy and error-prone.