Re: Unicode abuse

From: Mark Davis (mark.davis@jtcsv.com)
Date: Sun Mar 06 2005 - 23:10:36 CST

  • Next message: Mark E. Shoulson: "Re: double hyphen"

    HTML doesn't seem that relevant. After all, I can also have:

    <a href="http://badplace.com">http://goodplace.com>

    What is visible on the page need have nothing to do with the location that
    will be gone to. If that link is clicked on, some user agent will eventually
    have to show
    http://badplace.com, and at that point it should be normalized
    in appearance. So double-struck C is not really a problem -- and it wouldn't
    be anyway, unless it looked like something *other* than a C.

    ‎Mark

    ----- Original Message -----
    From: "Erik van der Poel" <erik@vanderpoel.org>
    To: "Doug Ewell" <dewell@adelphia.net>
    Cc: "Unicode Mailing List" <unicode@unicode.org>; "Mark Davis"
    <mark.davis@jtcsv.com>
    Sent: Sunday, March 06, 2005 16:32
    Subject: Re: Unicode abuse

    > Doug Ewell wrote:
    > > Erik van der Poel <erik at vanderpoel dot org> wrote:
    > >
    > >>I would have to agree that this is not a huge problem, but it is a
    > >>pity that the current version of Nameprep allows domain names to be
    > >>stored in other formats (e.g. HTML) with various unnecessary
    > >>characters coming from hither and yon in this vast Unicode space.
    > >
    > > Nameprep is a process by which characters are normalized, case-folded,
    > > thrown away, and so forth. What control would it have over whether
    > > domain names are stored in HTML or any other format?
    >
    > Hi Doug,
    >
    > Maybe I should show a piece of HTML with an IDN (Internationalized
    > Domain Name):
    >
    > <a href="http://www.payp&#1072;l.com/
    >
    > This snippet of HTML is from:
    >
    > http://secunia.com/multiple_browsers_idn_spoofing_test/
    >
    > As you can see, it is possible to use HTML's numeric character
    > references inside domain names, and they work. Likewise, it would be
    > *possible* to use &#x2102; (double-struck C) even though that just maps
    > to regular small 'c' in Nameprep.
    >
    > Nameprep itself does not control whether domain names are stored in
    > HTML. But the fact is that domain names *do* appear in HTML, and it is
    > *possible* to have unnecessary characters like double-struck C in domain
    > names in HTML. It may not be likely, but that's not my point. My point
    > is that it shouldn't even be possible. Why do we even want to allow such
    > garbage in HTML? And it's not HTML's fault, it's Nameprep's. If Nameprep
    > had chosen to filter double-struck C out *before* performing Unicode's
    > Normalization Form KC, we wouldn't have this "problem" (which, again, is
    > not a huge problem). Just kinda yucky. Highly subjective.
    >
    > Erik
    >



    This archive was generated by hypermail 2.1.5 : Mon Mar 07 2005 - 10:14:52 CST