Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Simon Josefsson (jas@extundo.com)
Date: Thu Jan 27 2005 - 12:55:09 CST

  • Next message: Markus Scherer: "Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"

    "Shawn Steele" <shawnste@winse.microsoft.com> writes:

    >> "Simon" said:
    >
    >> By referencing Unicode 3.2, StringPrep use the old interpretation.
    >> Clarifying this would be good
    >
    > So StringPrep has the identical ambiguity here because it doesn't
    > reference a particular interpretation of the standard, but rather it
    > references a document that is, obviously :-), a bit ambiguous.

    We seem to disagree on this.

    I believe the old document is non-ambiguous. It is possible to follow
    the old normative text and end up with an implementation that works
    fine for all practically occurring strings. People have done exactly
    this, and deployed the code.

    It is unfortunate that such an implementation would behave badly for a
    select few corner cases, but it is not the end of the world.

    >> Let's say x = U+1100 U+0300 U+1161. ToUnicode(x) = x by definition
    > (see > 4.2 of RFC 3490). ToAscii(ToUnicode(x)) =
    >> xn--ksa1467f, with the fix (i.e., how IDN is specified to work). You
    >> then get ToUnicode(ToAscii(ToUnicode(x))) = U+AC00 U+0300, which
    >> according to PR29 would be "wrong". With the proposed fix you would
    >> get U+1100 U+0300 U+1161 instead. There is nothing invalid about
    >> these IDN strings, although they supposedly do not occur naturally.
    >
    > I think your example's mixed up. U+1161 is blocked from combining with
    > U+1100 by the U+0300 in either form. Since U+1100 and U+1161 are start
    > characters the change shouldn't impact this case, both interpretations
    > should consistently normalize U+1100 U+0300 U+1161 to U+1100 U+0300
    > U+1161 (unchanged) and U+AC00 U+0300 to U+AC00 U+0300 (also unchanged).

    I disagree.

    According to <http://www.unicode.org/review/pr-29.html>, with the old
    wording, U+1100 U+0300 U+1161 would normalize into U+AC00 U+0300.

    > The problem would involve mixed up combining classes. If I have one of
    > these messed up strings u (for Unicode), and do ToAscii(u) on it, then
    > I'll get an ascii form a:
    > a = ToAscii(u)

    I'm with you so far.

    However, keep in mind that what 'a' looks like depending on how NFKC
    was implemented in the IDN implementation.

    If you use the old NFC language, for u = U+1100 U+0300 U+1161, you
    will get a = xn--ksa1467f. You will get another output value if your
    NFKC implementation, against the StringPrep specification, implement
    NFKC with the proposed modification.

    > Then if we do
    > u2 = ToUnicode(a)
    > a2 = ToAscii(u2)
    >
    > Now, using the "fixed" normalization, a == a2 && u == u2, however if we
    > used the alternate interpretation of the old UAX doc, then u != u2 and
    > NFKC(u) != NFKC(u2) so therefore a != a2. This is because u was
    > eventually normalized twice by the time we get to a2.
    >
    > The IDN/StringPrep RFCs require that ToAscii(ToUnicode(a)) == a. In
    > this example it would not, which is why I'm saying that this string
    > would pretty much be illegal according to IDN.

    I don't follow this part. Presumably you meant something else, much
    like I mixed up the symbol language earlier. Specifically:

    The invariant 'ToAscii(ToUnicode(a)) == a' doesn't hold. Consider the
    string a = ß. ToUnicode(ß) = ß. ToAscii(ß) = ss. ß != ss.

    > IDN needs this fix as badly as UAX 15 does.

    I would agree that IDN and UAX15 need _a_ fix, but not necessarily the
    proposed one.

    Regards,
    Simon



    This archive was generated by hypermail 2.1.5 : Thu Jan 27 2005 - 12:57:29 CST