Re: NFC/NFKC Normalization Edge Case

From: Bjoern Hoehrmann (
Date: Tue Sep 22 2009 - 16:11:55 CDT

  • Next message: Kenneth Whistler: "Re: NFC/NFKC Normalization Edge Case"

    * Jeff Senn wrote:
    >case 1: 1B11, 1B35
    >case 2: 0CCA, 0CD5
    >There are (non-compatibility) decompositions for both of these
    > 1B12 --> 1B11, 1B35
    > 0CCB --> 0CCA, 0CD5
    >All of these characters have combining class 0. Can they be canonically
    >combined? Even though the 2nd characters are NOT "combining"?

    My reading is that most certainly they can be and that's what the latest
    implementation I've written does (it is a very literal and data driven
    implementation), see the implementation and test script at

    And a Perl script that turns the Unicode XML database into a SQLite one

    which is needed. The NormalizationTest.txt file has the cases

      1B12;1B12;1B11 1B35;1B12;1B11 1B35;
      0CCB;0CCB;0CC6 0CC2 0CD5;0CCB;0CC6 0CC2 0CD5;
      0CCA;0CCA;0CC6 0CC2;0CCA;0CC6 0CC2;

    Note that

      # NFC
      # c2 == NFC(c1) == NFC(c2) == NFC(c3)
      # c4 == NFC(c4) == NFC(c5)
      # NFD
      # c3 == NFD(c1) == NFD(c2) == NFD(c3)
      # c5 == NFD(c4) == NFD(c5)

    Also note that you have the decomposition wrong, U+0CCB decomposes into
    the sequence U+0CC6 U+0CC2 U+0CD5 as per the decomposition of 0CCA, per-
    haps that is a source of confusion?

    >So, if the answer is indeed "YES", one might add
    >case 3: 0CCA, 0300, 0CD5 (admittedly unusual)
    >which clearly should not compose since ccc(0300) >= ccc(0CD5)

    I agree with this aswell.

    Björn Höhrmann · ·
    Am Badedeich 7 · Telefon: +49(0)160/4415681 ·
    25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · 

    This archive was generated by hypermail 2.1.5 : Tue Sep 22 2009 - 16:15:26 CDT