RE: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Shawn Steele (shawnste@winse.microsoft.com)
Date: Thu Jan 27 2005 - 11:57:02 CST

  • Next message: Donald Z. Osborn: "Re: The Yoruba under-diacritic"

    > "Simon" said:

    > By referencing Unicode 3.2, StringPrep use the old interpretation.
    > Clarifying this would be good

    So StringPrep has the identical ambiguity here because it doesn't
    reference a particular interpretation of the standard, but rather it
    references a document that is, obviously :-), a bit ambiguous.

    > Let's say x = U+1100 U+0300 U+1161. ToUnicode(x) = x by definition
    (see > 4.2 of RFC 3490). ToAscii(ToUnicode(x)) =
    > xn--ksa1467f, with the fix (i.e., how IDN is specified to work). You
    > then get ToUnicode(ToAscii(ToUnicode(x))) = U+AC00 U+0300, which
    > according to PR29 would be "wrong". With the proposed fix you would
    > get U+1100 U+0300 U+1161 instead. There is nothing invalid about
    > these IDN strings, although they supposedly do not occur naturally.

    I think your example's mixed up. U+1161 is blocked from combining with
    U+1100 by the U+0300 in either form. Since U+1100 and U+1161 are start
    characters the change shouldn't impact this case, both interpretations
    should consistently normalize U+1100 U+0300 U+1161 to U+1100 U+0300
    U+1161 (unchanged) and U+AC00 U+0300 to U+AC00 U+0300 (also unchanged).

    The problem would involve mixed up combining classes. If I have one of
    these messed up strings u (for Unicode), and do ToAscii(u) on it, then
    I'll get an ascii form a:
            a = ToAscii(u)

    Then if we do
            u2 = ToUnicode(a)
            a2 = ToAscii(u2)

    Now, using the "fixed" normalization, a == a2 && u == u2, however if we
    used the alternate interpretation of the old UAX doc, then u != u2 and
    NFKC(u) != NFKC(u2) so therefore a != a2. This is because u was
    eventually normalized twice by the time we get to a2.

    The IDN/StringPrep RFCs require that ToAscii(ToUnicode(a)) == a. In
    this example it would not, which is why I'm saying that this string
    would pretty much be illegal according to IDN. IDN needs this fix as
    badly as UAX 15 does.

    - Shawn

    Shawn Steele
    Software Design Engineer
    Windows/.Net Globalization
    (Normalization & IDN APIs)



    This archive was generated by hypermail 2.1.5 : Thu Jan 27 2005 - 11:57:00 CST