From: Erik van der Poel (erik@vanderpoel.org)
Date: Fri Mar 11 2005 - 13:23:59 CST
Hello Andrew,
Thank you for calculating the percentages of characters that changed
their General Category Value from one version of UCD to the next.
Your numbers made me think about it some more, and I'm now wondering
whether stability of the General Category Value is really such a concern
in the area that I'm focussing on currently, i.e. IDNs
(Internationalized Domain Names).
Recently, someone on the IETF IDN mailing list said that it is possible
to create a URI with a Unicode character that looks like a slash (/),
placed in such a way as to trick the user, making them think that they
are accessing a site other than the real one:
http://good.com/lgkjdslkfjsdlkfowiuroewirtlwjflsdkjrewewui.evil.biz/
In the actual spoof, the slash after good.com would be a fake slash, say
U+2215 DIVISION SLASH. The IDNA RFCs allow this character, and many
others, many of which might be dangerous in a similar way.
In the long run, applications will probably separate the domain name out
from the URI and try to educate users about security issues, and Mozilla
has already started to do so:
http://www.gerv.net/security/stay-safe/
However, in the short term, this is clearly a concern. Microsoft does
not even support IDNs. (Smart move. Or rather, smart lack of move.)
Mozilla started to support IDNs, but then turned it off for other
reasons (the paypal.com homograph spoof). Opera checks for dangerous
characters like U+2215.
The IDN Working Group did discuss limiting IDNs to Letters, Digits and
Hyphen, just like RFC 952 did for ASCII. But for some reason, they just
went ahead and let almost all of Unicode in. So now there is some
discussion about limiting IDNs to the Unicodes that are categorized as
Letters, Numbers and Marks (or just nonspacing marks), and the Hyphen
(or a very small number of Punctuation characters).
There is some evidence that the IDN Working Group had a concern that the
General Category Value in the UCD was not very stable, and that it might
not be a good idea to base an Internet standard on something like that.
However, I'm now thinking that the real concern is that phishers are
taking advantage of the way that domain names and URIs are presented to
the user and of the way that some Unicodes or even ASCIIs look:
As it turns out, Unicode includes U+16C1, a Runic Letter that looks like
the vertical bar (|). This would argue that IDNs should not just be
limited to Unicode's Letter, Number and Mark categories. They should
also disallow certain Unicode blocks, such as the Runic block, *for
now*. Maybe apps will improve their domain name displays to the point
that we feel safe enough to include Runic, but I don't feel we're there yet.
This email is getting rather long. Better stop now.
Thanks again,
Erik
PS For those that are interested, more info at nameprep.org.
Andrew C. West wrote:
> According to my calculations, the number of characters which changed their
> General Category from one version of Unicode to the next is :
>
> 1.1.5 -> 2.0.14 = 474 (1.384%)
> 2.0.14 -> 2.1.2 = 1 (0.0025%)
> 2.1.2 -> 2.1.5 = 16 (0.0410%)
> 2.1.5 -> 2.1.8 = 18 (0.0462%)
> 2.1.8 -> 2.1.9 = 3 (0.0077%)
> 2.1.9 -> 3.0.0 = 85 (0.2182%)
> 3.0.0 -> 3.0.1 = 0 (0%)
> 3.0.1 -> 3.1.0 = 3 (0.0061%)
> 3.1.0 -> 3.2.0 = 7 (0.0074%)
> 3.2.0 -> 4.0.0 = 16 (0.0168%)
> 4.0.0 -> 4.0.1 = 1 (0.0010%)
> 4.0.1 -> 4.1.0 = 12 (0.0124%)
>
> I don't know what this tells you about the stability of the UCD data though.
This archive was generated by hypermail 2.1.5 : Fri Mar 11 2005 - 13:25:18 CST