Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Simon Josefsson (jas@extundo.com)
Date: Thu Jan 27 2005 - 15:02:19 CST

  • Next message: Simon Josefsson: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"

    "Shawn Steele" <shawnste@winse.microsoft.com> writes:

    >> If you use the old NFC language, for u = U+1100 U+0300 U+1161, you
    >> will get a = xn--ksa1467f.
    >
    > The xn-- is confusing about the Unicode code points represented, so
    > I'm going to describe them in a different way. (also 'cause I doon't
    > have a 'broken' implementation to generate the bad code points
    > quickly ;-))

    There is an online interface to one such implementation at
    <http://josefsson.org/idn.php>, although I would argue that it is
    correct, and not broken, at least until StringPrep/IDN is updated to
    handle this issue.

    > RFC 3490 4.2 clearly states that ToAscii(ToUnicode(a)), where a is an xn-- format string, must round trip, otherwise it's not a valid punycode string and then the API's supposed to return the originnal input, not the badly decoded input.
    >
    > Using the OLD strict interpretation of D2 that ignores the rest of the document:
    >
    > A) ToAscii(U+1100 U+0300 U+1161 U+0323) becomes the punycode representation of xn--(punycode of)U+AC00 U+0300 U+0323
    >
    > B) ToUnicode(xn--(punycode of)U+AC00 U+0300 U+0323) becomes U+AC00 U+0300 U+0323
    >
    > C) ToAscii(U+AC00 U+0300 U+0323) becomes the punycode representation of xn--(punycode of)U+AC00 U+0323 U+0300
    >
    > Since RFC 3490 4.2 specifies that ToAscii(ToUnicode(x)) == x, however in this case it doesn't, so the B xn-- string is invalid and ToUnicode would be required to return the xn-- string, NOT the Unicode version.
    >
    > This should fairly clearly prove that IDN is broken anyway with the old D2, so fixing this shouldn't be an issue for IDN. This change solves IDN problems, it doesn't introduce them.

    I understand what you mean now.

    Your argument works well for the sub-set of problem sequences that are
    unstable under NFKC.

    However, the argument does not work for all problem sequences, and in
    particular it does not work for the example in PR29 I quoted.

    As far as I can tell, the claim that all the PR29 problem sequences
    are invalid IDN strings is false.

    It would be interesting to find out what percentage of the problem
    sequences are unstable under NFKC.

    Thanks.



    This archive was generated by hypermail 2.1.5 : Thu Jan 27 2005 - 15:05:27 CST