L2/07-185

Source: Mark Davis
Date: Mar 30, 2007 6:14 PM
Subject: Re: IDNAbis compatibility

We had a bit more time to look at IDNAbis compatibility, and here are some better (and hopefully clearer) results. Out of a significantly large sampling of the web, there were about 800,000 cases where an HTML document contained an href="..." that contained a host name that was valid IDNA2003. We tested those host names to see if they would also be valid under IDNAbis (based on the current working proposals). About 85% were valid, about 8% more would be valid if IDNAbis were changed to also do case and width folding, and about 6% would still be invalid even if case and width foldings were applied. (The width foldings are applying NFKC to just the half-width and full-width characters to get the normal ones.)

Here are some more details, where A0-A4 are disjoint categories.

 

A0: Passes IDNAbis

708,760

85.26%

A1: Passes IDNAbis after case folding

22,714

2.73%

A2: Passes IDNAbis after width folding

47,312

5.69%

A3: Passes IDNAbis after apply width folding, and then case folding.

4

0.00%

A4: Failed to pass IDNAbis after 3 steps

52,456

6.31%


 

 

 

A5: Passes IDNA = sum(A1-A4)

831,246

100.00%


This differs from some of our previous data, because we are explicitly testing IDNA vs IDNAbis (not just approximating the latter), and also filtering out invalid URLs. I will be out next week, but we'll try to follow up with more of a breakdown of A4.

Mark