RE: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Shawn Steele (shawnste@winse.microsoft.com)
Date: Thu Jan 27 2005 - 13:41:37 CST

  • Next message: Adam Twardoch: "Re: The Yoruba under-diacritic"

    > According to <http://www.unicode.org/review/pr-29.html>, with the old
    > wording, U+1100 U+0300 U+1161 would normalize into U+AC00 U+0300.

    Sorry, I hadn't looked at the document in a while and hadn't seen that example. I was considering more along the lines of the Instability Example.

    > I'm with you so far.

    > However, keep in mind that what 'a' looks like depending on how NFKC
    > was implemented in the IDN implementation.

    So with the instability example:

    Instability Example
    An example where the existing definition D2 causes failure of stability for repeated normalization is shown in sequence D:

            U+1100 (ᄀ) HANGUL CHOSEONG KIYEOK +
            U+0300 (◌̀) COMBINING GRAVE ACCENT +
            U+1161 (ᅡ) HANGUL JUNGSEONG A +
            U+0323 (◌̣) COMBINING DOT BELOW

    The first NFC normalization produces sequence E:

            U+AC00 (가) HANGUL SYLLABLE GA +
            U+0300 (◌̀) COMBINING GRAVE ACCENT +
            U+0323 (◌̣) COMBINING DOT BELOW

    A subsequent NFC normalization reverses the order of the accents, producing sequence F:

            U+AC00 (가) HANGUL SYLLABLE GA +
            U+0323 (◌̣) COMBINING DOT BELOW +
            U+0300 (◌̀) COMBINING GRAVE ACCENT

    > If you use the old NFC language, for u = U+1100 U+0300 U+1161, you
    > will get a = xn--ksa1467f.

    The xn-- is confusing about the Unicode code points represented, so I'm going to describe them in a different way. (also 'cause I don't have a 'broken' implementation to generate the bad code points quickly ;-))

    > The invariant 'ToAscii(ToUnicode(a)) == a' doesn't hold. Consider the s
    > string a = ß. ToUnicode(ß) = ß. ToAscii(ß) = ss. ß != ss.

    I'm disregarding the ToUnicode(Unicode) case because Unicode is already Unicode. Your example is strange since ToUnicode(ß), although permitted by RFC 3490 isn't very interesting since to resolve that name I'd have to do a ToAscii() on it, and it'd be stuck in ss form forever.

    RFC 3490 4.2 clearly states that ToAscii(ToUnicode(a)), where a is an xn-- format string, must round trip, otherwise it's not a valid punycode string and then the API's supposed to return the original input, not the badly decoded input.

    Using the OLD strict interpretation of D2 that ignores the rest of the document:

    A) ToAscii(U+1100 U+0300 U+1161 U+0323) becomes the punycode representation of xn--(punycode of)U+AC00 U+0300 U+0323

    B) ToUnicode(xn--(punycode of)U+AC00 U+0300 U+0323) becomes U+AC00 U+0300 U+0323

    C) ToAscii(U+AC00 U+0300 U+0323) becomes the punycode representation of xn--(punycode of)U+AC00 U+0323 U+0300

    Since RFC 3490 4.2 specifies that ToAscii(ToUnicode(x)) == x, however in this case it doesn't, so the B xn-- string is invalid and ToUnicode would be required to return the xn-- string, NOT the Unicode version.

    This should fairly clearly prove that IDN is broken anyway with the old D2, so fixing this shouldn't be an issue for IDN. This change solves IDN problems, it doesn't introduce them.

    - Shawn

    Shawn Steele
    Software Design Engineer
    Windows/.Net Globalization
    (Normalization & IDN APIs)
    Microsoft



    This archive was generated by hypermail 2.1.5 : Thu Jan 27 2005 - 13:41:21 CST