NFC/NFKC Normalization Edge Case

From: Jeff Senn (senn@maya.com)
Date: Tue Sep 22 2009 - 14:36:45 CDT

  • Next message: Bjoern Hoehrmann: "Re: NFC/NFKC Normalization Edge Case"

    Can someone sort out an ambiguity for me in composition during
    normalization?
    Either I misunderstand, or a couple of widely deployed implementations
    have bugs,
    and/or the standards docs imply an inconsistency.

    Here are 2 test cases and the question is, can the characters be
    *canonically*
    combined during normalization?

    case 1: 1B11, 1B35
    case 2: 0CCA, 0CD5

    There are (non-compatibility) decompositions for both of these
    sequences:

      1B12 --> 1B11, 1B35
      0CCB --> 0CCA, 0CD5

    All of these characters have combining class 0. Can they be canonically
    combined? Even though the 2nd characters are NOT "combining"?

    My read of the current UAX15 implies "yes".

    UAX15 currently says: "D2. In any character sequence beginning with a
    starter S, a character C is blocked from S if and only if there is
    some character B between S and C, and either B is a starter or it has
    the same or higher combining class as C."

    Since there is no character between S and C, I assume C is not
    "blocked".

    However a previous of draft of UAX15 uses the phrase "A combining
    character C
    can be canonically combined with a base character B..." which implies
    "No".

    http://unicode.org/reports/tr15/
    http://unicode.org/reports/tr15/pdtr15.html

    At least 2 implementations do the combination in case 2: Python and
    the ICU library
    (e.g. http://minaret.info/test/normalize.msp ) However ICU seems
    inconsistent in that
    it does NOT combine case 1!

    So, if the answer is indeed "YES", one might add

    case 3: 0CCA, 0300, 0CD5 (admittedly unusual)

    which clearly should not compose since ccc(0300) >= ccc(0CD5)
    (http://www.unicode.org/review/pr-29.html)

    Python, however, (incorrectly) yields: 0CCB, 0300

    Help!



    This archive was generated by hypermail 2.1.5 : Tue Sep 22 2009 - 15:36:12 CDT