RE: Proposed Update Unicode Technical Standard #46 (Unicode IDNA Compatibility Processing)

From: Colosi, John (jcolosi@verisign.com)
Date: Wed Sep 22 2010 - 14:34:19 CDT

  • Next message: Markus Scherer: "Re: Proposed Update Unicode Technical Standard #46 (Unicode IDNA Compatibility Processing)"

    Hi Mark,

     

     

    Thanks for the response. I appreciate your points. If I can summarize, I think the spec creates special rules for registries, different from rules for other kinds of clients.
     
    “By the time a string enters the IDNA registration process … it MUST be … in Normalization Form C. … [Registries] MUST accept only the exact string for which registration is requested, free of any mappings or local adjustments.”
    -- RFC 5891, section 4.1
     

    I think your point is that UTS #46 has a broader scope than just registries, and so it allows the Unicode 6.0 mapping. In fact, it’s very purpose seems to be a bridging of the gap between 2003 and 2008. So in a trivial sense, a strict reading of Idna2008 for registries will cause some issues.

     

     

    I’m still confused about the last example. The Punycode sequence “xn--53h” is converting for me to U+2615.
    The mapping for this character appears to be empty:
    2614..2615 ; valid
    And the tables RFC appears to prohibit the point:
    2460..26CD ; DISALLOWED # CIRCLED DIGIT ONE..DISABLED CAR

    Maybe I’m still missing something, but this just doesn’t look like valid input, even if I apply the mapping. Not sure.

     

     

    Thanks again,

    -- John

     

     

    John Colosi | Naming Services | VeriSign, Inc.
    Å 703.948.3211 È 703.967.4062 Ê 703.421.8233

    This message is intended for the use of the individual or entity to
    which it is addressed, and may contain information that is privileged,
    confidential and exempt from disclosure under applicable law. Any
    unauthorized use, distribution, or disclosure is strictly prohibited. If
    you have received this message in error, please notify sender
    immediately and destroy/delete the original transmission.



    From: mark.edward.davis@gmail.com [mailto:mark.edward.davis@gmail.com] On Behalf Of Mark Davis ?
    Sent: Sunday, September 19, 2010 7:27 PM
    To: Colosi, John
    Cc: unicode@unicode.org; UTC; Markus Scherer
    Subject: Re: Proposed Update Unicode Technical Standard #46 (Unicode IDNA Compatibility Processing)

     

    Thanks for checking the data. I'm sorry for not responding earlier; I was on vacation, and am now working through my backlog of email.

     

    Some of the differences are because UTS#46 provides a compatibility 'bridge' between IDNA2003 and IDNA2008. For details of these particular cases, see below.

     

    Note that the current tests do not attempt to be exhaustive, eg include a line for every character with the status for whether it is valid or not. Such a test can be written using the main data file at http://unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt.

     

    Other test cases can be added for the future; if you (or others) have suggestions for good test lines, please let us know.


    Mark

    — Il meglio è l’inimico del bene —



    On Thu, Sep 16, 2010 at 14:59, Colosi, John <jcolosi@verisign.com> wrote:

    Hello all,

     

    I represent the VeriSign Domain Name Registry as an implementer of the latest IDNA specifications. The following four (4) questions arose during our implementation of the conformance test.

     

     

    Question 1 of 4

    Line 204

    Input \u0646 \u0627 \u0645 \u0647 \u200C \u0627 \u06CC

    Reference Appendix A.1 of RFC 5892 (Tables) <https://trac.tools.ietf.org/html/rfc5892>

    Issue Per the reference, the ZWNJ (\u200C) must meet one of two qualifications. It must be preceded by a character with VIRAMA combining class. OR the characters in the label must have a certain pattern of joining types. This input does not meet either of these criteria, and appears to be an invalid IDN label with respect to the IDNA 2008 standards. There are ten (10) such lines in the input file.

     

    This is by design. UTS#46 does not have the contextual checks for ZWJ and ZWNJ.

     

    Background: While those are excellent checks to have, and are recommended, they only prevent a small fraction of the homoglyph exploits, so they are not required by UTS#46 and are not tested for in the file. (If you disagree with that approach, you should bring that up to the UTC for the next version of UTS#46.) UTS#46 does allow for implementations to be stricter if desired, so any implementation can apply those IDNA2008 checks.

     

    Note that we could add a field in the test file that indicated whether the input (or mapped input [see below]) was valid under IDNA2008. Do people think that would be helpful?

     

             

             

            Question 2 of 4

            Line 319

            Input …1234567890123456789012345678901234567890123456789012345678901234…

            Reference Sections 3.1 and 3.5 of RFC 1034 <http://www.ietf.org/rfc/rfc1034.txt>

            Issue Per the reference, DNS labels cannot contain more than 63 octets. It appears that this is a purposeful test, since the first label is exactly 63 octets, and the second label is 64 octets. This does not apply to other applications, but these lines of input are not valid for DNS. There are three (3) such lines in the input file.

     

    This appears to be a mistake in the conformance file generation. I'll look at it to see what is happening.

     

             

             

            Question 3 of 4

            Line 319

            Input U \u0308 . xn--tda

            Reference Section 4.1 of RFC 5891 (Protocol) <https://trac.tools.ietf.org/html/rfc5891>

            Issue Per the reference, input into the IDNA Registration process “MUST be… in Normalization Form C”. This input does not meet these standards. The first label is not properly normalized. Implementations of IDNA 2008 for registration should expect an exception. There are four (4) such lines in the input file.

     

    Here is the situation:

    * IDNA2003 allows as input denormalized text; it requires that text be normalized (and case-folded) in the process of generating the punycode.
    * IDNA2008 disallows denormalized text per se; however it allows a mapping phase for the input, which can do a normalization and case folding for consistency with IDNA2003.

    UTS#46 provides for a mapping that is consistent with IDNA2003 and allowed by IDNA2008. That mapping normalizes U\u0308 to a lowercase U-umlaut, which is valid.

     

             

             

            Question 4 of 4

            Line 276

            Input xn—53h

            Reference Appendix B.1 of RFC 5892 (Tables) <https://trac.tools.ietf.org/html/rfc5892>

            Issue Per the reference, the character \u2615 is disallowed.

            2460..26CD ; DISALLOWED # CIRCLED DIGIT ONE..DISABLED CAR

            Implementations should expect an exception. There are twenty (20) such lines in the input file.

              

     

    This is another instance where UTS#46 is mapping. See the line of http://unicode.org/Public/idna/6.0.0/IdnaMappingTable.txt, which has the following. Such a mapping is permitted by IDNA2008.

    2461 ; mapped ; 0032 # 1.1 CIRCLED DIGIT TWO

             

            Any input is appreciated,

            -- John

             

             

            John Colosi | Naming Services | VeriSign, Inc.
            Å 703.948.3211 È 703.967.4062 Ê 703.421.8233
            
            This message is intended for the use of the individual or entity to
            which it is addressed, and may contain information that is privileged,
            confidential and exempt from disclosure under applicable law. Any
            unauthorized use, distribution, or disclosure is strictly prohibited. If
            you have received this message in error, please notify sender
            immediately and destroy/delete the original transmission.

     



    This archive was generated by hypermail 2.1.5 : Wed Sep 22 2010 - 14:40:35 CDT