Re: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Thu Jan 27 2005 - 13:21:07 CST

  • Next message: Shawn Steele: "RE: Open Issue #61: Proposed Update UAX #15 Unicode Normalization Forms"

    Shawn Steele wrote:
    > I think your example's mixed up. U+1161 is blocked from combining with
    > U+1100 by the U+0300 in either form. Since U+1100 and U+1161 are start
    > characters the change shouldn't impact this case, both interpretations
    > should consistently normalize U+1100 U+0300 U+1161 to U+1100 U+0300
    > U+1161 (unchanged) and U+AC00 U+0300 to U+AC00 U+0300 (also unchanged).

    No, that's exactly the problem, see PRI 29. According to the letter of the old UAX version
    (definitions, not intent or sample), you would indeed get AC00 0300. The problem exactly occurs with
    a combining mark (ccc>0) between two starters that combine.

    Important:

    1. Aside from broken idempotency, this interpretation of the old UAX version "normalizes" such text
    to something that is *not canonically equivalent* to the input - it changes some text to some
    completely different text.

    2. There also exist strings (see PRI 29) where the application of NFC[old UAX] or NFKC[old UAX]
    produces output that is not only different text (not canonically equivalent) but also *not in
    canonical order*. As a result, something you got from normalization may not even pass the
    normalization quick check: NFC_quick_check(NFC(string))=NO.

    That just had to be fixed, and was.

    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Thu Jan 27 2005 - 13:27:26 CST