Re: nameprep, IDN spoofing and the registries

From: George W Gerrity (g.gerrity@gwg-associates.com.au)
Date: Tue Feb 22 2005 - 08:10:49 CST

  • Next message: George W Gerrity: "Re: [idn] IDN spoofing"

    On 22 Feb 2005, at 16:18, Erik van der Poel wrote:

    > OK, I think this latest flurry of emails is starting to form a picture
    > of What We Must Do (re: nameprep, IDN spoofing and the registries):
    >
    > Basically, the work to install filters at the registries, and the work
    > to write the next version of the nameprep spec can proceed in
    > parallel, pretty much independently.

    Yep.

    > As George points out, the registries are going to have to start
    > filtering IDN lookalikes, otherwise they will eventually face lawsuits
    > from the "big boys" (as George so delightfully puts it). The ccTLDs
    > will have a relatively easy task, while the gTLDs like .com will have
    > the difficult task of deciding which subset of Unicode to allow.

    I think that I suggested that the ccTLDs would decide their own
    encoding for the TLD tag, but I was wrong to separate that from the
    general problem of what acceptable foldings/encodings should be applied
    to all TLDs. In any case, the more difficult problems occur underneath.

    > They will also have to go through their database, looking for
    > lookalikes, and deleting them or transferring them to new owners,
    > probably paying their previous owners back. The registrars might have
    > to be involved in the money transaction too. What a mess. I don't envy
    > the gTLDs. Maybe the Unicode Consortium could help them out by
    > providing homograph tables.

    Yes, but there shouldn't be too many problems, yet. That's why the
    quicker one gets going, the better.

    > One possible approach for the gTLDs is to halt IDN registration now.
    > Then they can work on their filters, starting with a small subset of
    > Unicode. After reopening IDN registration, they can grow the subset if
    > there is enough demand for characters outside the initial subset.

    I don't think that it needs to be halted now. Just give them a quick
    filter in perl to sieve out names whose code points don't all come from
    one script. The few that get caught can either be delayed until better
    filters come along, or they can be handled on a once-off basis based on
    guidelines that one can put into place pretty quickly.

    > If the gTLDs are going to do some serious subsetting, then they will
    > probably also need to provide software to the registrars that will map
    > users' characters into the subset. E.g. converting a user's local
    > charset to the subset of Unicode.

    At this point, I took off two hours to download and read the relevant
    RFCs dealing with IDNA, as I haven't been following the TLD
    standardisation process that carefully until now. Having done my
    homework, I still fail to understand why it is up to IDNA to provide
    mappings between local charsets and Unicode. These mappings are already
    available, but are not one-to-one: some local codes have no equivalent
    in Unicode, and vice-versa.

    > Then again, this might be an area where registrars could compete with
    > each other, to provide the most friendly software to the end-user
    > (registrant).

    Any mapping is at the registration level, including subsetting to
    preclude spoofing. We imagine that mapping is many-to-one to yield a
    “canonical name”, which is the one registered. A typical example of
    mapping is the case-folding of ASCII names that is already processed to
    all lower-case (which is the “canonical mapping” result, and is used in
    the DNS and in security certificates). Basically, one submits a
    proposed name for registration, and it is either accepted or refused.
    If refused, two reasons can be given: a) the canonical form of the name
    is already taken; or b) the name is not well-formed according to the
    restrictions applied by subsetting algorithms designed to minimise
    spoofing.

    It is not apparent to me where the question of user interface design is
    applicable, as all this happens “under the hood”, so to speak.

    > On the other side, we have the nameprep spec, and the work required to
    > rev it. As John Klensin points out in another email, nameprep will
    > eventually have to be updated to include new Unicode characters.
    > Nameprep specifies Unicode 3.2, but Unicode itself is already at
    > 4.0.1, and may be even further along by the time we finish discussing
    > and drafting nameprep bis (new version). Call it nameprep2.

    Nameprep has nothing to do with the type of filtering we are
    discussing. The registration process procedes as follows:

      --------------- 1 --------- -------------
    |<Local_Charset>| --> |<Unicode>| --> |<nameprep_vx>| -->
      --------------- --------- -------------

          -------------------- ----------
         |<Subsetting_Filters>| --> |<Punicode>|
          -------------------- ----------

    Point 1 may fail if there is no mapping: not too likely for names
    people will want to use, but may fail for, say, Japanese formal names
    with character variants that are still not completely encoded.

    <nameprep_vx> may, besides the mapping to the specified Unicode
    normalisation, where appropriate perform case-folding. The algorithm to
    perform this operation really should never fail.

    The <Subsetting_Filters> are exactly those filters designed to reduce
    the possible namespace to a tractable size, in which there will be no
    names that are possible spoofs of other names. Some of them may be the
    same for all TLDs, while others will be specific to a given TLD. The
    ones that are the same for all TLDs can migrate into <nameprep_vx> at a
    later date, if necessary.

    Only those names getting through these filters will automatically be
    considered as candidates for registration. Alternatively, the
    <Subsetting_Filters> may have two outputs: those that get through the
    strict rules, and those that get through a coarse sieve but not the
    finer ones. Those that get through the coarse sieve would need to be
    turned over to an expert in orthography for further study before
    allowing registration.

    > Now, one item that is clearly on nameprep2's table is the new version
    > of Unicode. Another item that could be considered is the banning of
    > slash '/' homographs and others.

    <nameprep_vx> is not the place to deal with homographs, except perhaps
    for the obvious, such as mapping full- or half-width numerals,
    Cyrillic, Greek, and Latin characters to the equivalent homographs in
    the normal script areas for them.

    > This type of spoofing was recently discussed on the IDN list. Certain
    > Unicode blocks, like the math characters,

    The Math characters, yes.

    > might also be banned instead of mapped as they are now.

    Or we could wait and put them in the <Subsetting_Filters> for the
    moment.

    > And I'm sure we would discuss mapping or banning the homographs, such
    > as Cyrillic small 'a'.

    No. For the moment, these sort of homographs need to be included in the
    <Subsetting_Filters> area. Otherwise, we will be precluding any sort of
    mixed-script names. The problem is not just that of identifying
    homographs, but of also determining what is a homograph of what, and
    where to distinguish. Thus, we want to ban Cyrillic homographs in the
    “XML” portion of the mixed name “XML-россия”, and Latin homographs in
    the Cyrillic part of the name. We also want to ban homographs from
    other scripts, such as Greek or Coptic from either part. We can't do
    that with an all-encompassing algorithm: it needs to be taylored to the
    particular TLD.

    > A lot of this is likely to be controversial, and some people might
    > suggest that we leave the subsetting to the registries, since they
    > have to do it anyway. So, instead of shrinking the character set,
    > nameprep2 might just grow it (for the new version of Unicode). I don't
    > know. We'll see.

    The controversial bits belong in the <Subsetting_Filters> component,
    and probably will be local. Those parts that are going to be global
    belong (ultimately) in <nameprep_vx>.

    > I'm not sure whether we would need a new ACE prefix if we are only
    > adding characters (and not removing any). I'm too tired right now to
    > think about it.

    Why? The problem of backward compatibility won't occur for additions,
    and where new rules subset the name space for all TLDs, they pruning of
    previously legitimate names will have to occur, anyway. BTW, it might
    make sense to add a first registration date to a name, like in
    copyright, so that names that are pruned are those registered after the
    original one of the lookalike set.

    George



    This archive was generated by hypermail 2.1.5 : Tue Feb 22 2005 - 08:12:48 CST