Re: Unicode abuse

From: Erik van der Poel (
Date: Sun Mar 06 2005 - 18:32:40 CST

  • Next message: Erik van der Poel: "Re: Unicode abuse"

    Doug Ewell wrote:
    > Erik van der Poel <erik at vanderpoel dot org> wrote:
    >>I would have to agree that this is not a huge problem, but it is a
    >>pity that the current version of Nameprep allows domain names to be
    >>stored in other formats (e.g. HTML) with various unnecessary
    >>characters coming from hither and yon in this vast Unicode space.
    > Nameprep is a process by which characters are normalized, case-folded,
    > thrown away, and so forth. What control would it have over whether
    > domain names are stored in HTML or any other format?

    Hi Doug,

    Maybe I should show a piece of HTML with an IDN (Internationalized
    Domain Name):

    <a href="http://www.payp&#1072;

    This snippet of HTML is from:

    As you can see, it is possible to use HTML's numeric character
    references inside domain names, and they work. Likewise, it would be
    *possible* to use &#x2102; (double-struck C) even though that just maps
    to regular small 'c' in Nameprep.

    Nameprep itself does not control whether domain names are stored in
    HTML. But the fact is that domain names *do* appear in HTML, and it is
    *possible* to have unnecessary characters like double-struck C in domain
    names in HTML. It may not be likely, but that's not my point. My point
    is that it shouldn't even be possible. Why do we even want to allow such
    garbage in HTML? And it's not HTML's fault, it's Nameprep's. If Nameprep
    had chosen to filter double-struck C out *before* performing Unicode's
    Normalization Form KC, we wouldn't have this "problem" (which, again, is
    not a huge problem). Just kinda yucky. Highly subjective.


    This archive was generated by hypermail 2.1.5 : Sun Mar 06 2005 - 18:34:17 CST