Re: UCD stability

From: Erik van der Poel (
Date: Fri Mar 11 2005 - 13:23:59 CST

  • Next message: Markus Scherer: "Re: UCD stability"

    Hello Andrew,

    Thank you for calculating the percentages of characters that changed
    their General Category Value from one version of UCD to the next.

    Your numbers made me think about it some more, and I'm now wondering
    whether stability of the General Category Value is really such a concern
    in the area that I'm focussing on currently, i.e. IDNs
    (Internationalized Domain Names).

    Recently, someone on the IETF IDN mailing list said that it is possible
    to create a URI with a Unicode character that looks like a slash (/),
    placed in such a way as to trick the user, making them think that they
    are accessing a site other than the real one:

    In the actual spoof, the slash after would be a fake slash, say
    U+2215 DIVISION SLASH. The IDNA RFCs allow this character, and many
    others, many of which might be dangerous in a similar way.

    In the long run, applications will probably separate the domain name out
    from the URI and try to educate users about security issues, and Mozilla
    has already started to do so:

    However, in the short term, this is clearly a concern. Microsoft does
    not even support IDNs. (Smart move. Or rather, smart lack of move.)
    Mozilla started to support IDNs, but then turned it off for other
    reasons (the homograph spoof). Opera checks for dangerous
    characters like U+2215.

    The IDN Working Group did discuss limiting IDNs to Letters, Digits and
    Hyphen, just like RFC 952 did for ASCII. But for some reason, they just
    went ahead and let almost all of Unicode in. So now there is some
    discussion about limiting IDNs to the Unicodes that are categorized as
    Letters, Numbers and Marks (or just nonspacing marks), and the Hyphen
    (or a very small number of Punctuation characters).

    There is some evidence that the IDN Working Group had a concern that the
    General Category Value in the UCD was not very stable, and that it might
    not be a good idea to base an Internet standard on something like that.

    However, I'm now thinking that the real concern is that phishers are
    taking advantage of the way that domain names and URIs are presented to
    the user and of the way that some Unicodes or even ASCIIs look:

    As it turns out, Unicode includes U+16C1, a Runic Letter that looks like
    the vertical bar (|). This would argue that IDNs should not just be
    limited to Unicode's Letter, Number and Mark categories. They should
    also disallow certain Unicode blocks, such as the Runic block, *for
    now*. Maybe apps will improve their domain name displays to the point
    that we feel safe enough to include Runic, but I don't feel we're there yet.

    This email is getting rather long. Better stop now.

    Thanks again,


    PS For those that are interested, more info at

    Andrew C. West wrote:
    > According to my calculations, the number of characters which changed their
    > General Category from one version of Unicode to the next is :
    > 1.1.5 -> 2.0.14 = 474 (1.384%)
    > 2.0.14 -> 2.1.2 = 1 (0.0025%)
    > 2.1.2 -> 2.1.5 = 16 (0.0410%)
    > 2.1.5 -> 2.1.8 = 18 (0.0462%)
    > 2.1.8 -> 2.1.9 = 3 (0.0077%)
    > 2.1.9 -> 3.0.0 = 85 (0.2182%)
    > 3.0.0 -> 3.0.1 = 0 (0%)
    > 3.0.1 -> 3.1.0 = 3 (0.0061%)
    > 3.1.0 -> 3.2.0 = 7 (0.0074%)
    > 3.2.0 -> 4.0.0 = 16 (0.0168%)
    > 4.0.0 -> 4.0.1 = 1 (0.0010%)
    > 4.0.1 -> 4.1.0 = 12 (0.0124%)
    > I don't know what this tells you about the stability of the UCD data though.

    This archive was generated by hypermail 2.1.5 : Fri Mar 11 2005 - 13:25:18 CST