Re: NFC/NFKC Normalization Edge Case

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Sep 22 2009 - 16:44:32 CDT

  • Next message: Jeff Senn: "Re: NFC/NFKC Normalization Edge Case"

    Jeff Senn asked:

    > Can someone sort out an ambiguity for me in composition during
    > normalization?
    > Either I misunderstand, or a couple of widely deployed implementations
    > have bugs,
    > and/or the standards docs imply an inconsistency.
    >
    > Here are 2 test cases and the question is, can the characters be
    > *canonically*
    > combined during normalization?
    >
    > case 1: 1B11, 1B35
    > case 2: 0CCA, 0CD5
    >
    > There are (non-compatibility) decompositions for both of these
    > sequences:
    >
    > 1B12 --> 1B11, 1B35
    > 0CCB --> 0CCA, 0CD5
    >
    > All of these characters have combining class 0. Can they be canonically
    > combined? Even though the 2nd characters are NOT "combining"?

    There's the first mistake. Both of the 2nd characters in
    these sequences *ARE* combining:

    1B35;BALINESE VOWEL SIGN TEDUNG;Mc;0;L;;;;;N;;;;;
    0CD5;KANNADA LENGTH MARK;Mc;0;L;;;;;N;;;;;

    gc=Mc indicates a combining mark -- they are just not *non-spacing*
    combining marks.

    > My read of the current UAX15 implies "yes".

    Correct. The entries from NormalizationTest.txt spell out
    what the various normalized forms are:

         NFC NFD NFKC NFKD
    1B12;1B12;1B11 1B35;1B12;1B11 1B35; # ... BALINESE LETTER OKARA TEDUNG

         NFC NFD NFKC NFKD
    0CCB;0CCB;0CC6 0CC2 0CD5;0CCB;0CC6 0CC2 0CD5; # ... KANNADA VOWEL SIGN OO

    > UAX15 currently says: "D2. In any character sequence beginning with a
    > starter S, a character C is blocked from S if and only if there is
    > some character B between S and C, and either B is a starter or it has
    > the same or higher combining class as C."
    >
    > Since there is no character between S and C, I assume C is not
    > "blocked".

    Correct.

    >
    > However a previous of draft of UAX15 uses the phrase "A combining
    > character C
    > can be canonically combined with a base character B..." which implies
    > "No".
    >
    > http://unicode.org/reports/tr15/
    > http://unicode.org/reports/tr15/pdtr15.html

    Proposed Draft UTR #15 (from 1998!) has no status. It is not
    an approved document, nor was it ever. It should not be referred
    to for anything other than historical interest in the development
    of the standard.

    Please refer only to the *approved* version of UAX #15 for such
    a discussion.

    > At least 2 implementations do the combination in case 2: Python and
    > the ICU library
    > (e.g. http://minaret.info/test/normalize.msp ) However ICU seems
    > inconsistent in that
    > it does NOT combine case 1!

    Which *version* of ICU? Balinese was added in Unicode 5.0.
    If you are using an older version of ICU that had not yet
    been updated to Unicode 5.0, then it would not have knowledge
    of Balinese, and the sequence <1B11, 1B35> (for that version)
    would be unchanged by normalization.

    >
    > So, if the answer is indeed "YES", one might add
    >
    > case 3: 0CCA, 0300, 0CD5 (admittedly unusual)
    >
    > which clearly should not compose since ccc(0300) >= ccc(0CD5)
    > (http://www.unicode.org/review/pr-29.html)
    >
    > Python, however, (incorrectly) yields: 0CCB, 0300

    That looks like a bug in Python.

    --Ken



    This archive was generated by hypermail 2.1.5 : Tue Sep 22 2009 - 16:47:46 CDT