Implementing NFC

From: Daniel Ehrenberg (
Date: Thu Mar 15 2007 - 20:49:16 CST

  • Next message: Philippe Verdy: "RE: Implementing NFC"

    I'm working on adding Unicode support (possibly eventually conformace)
    to an obscure programming language called Factor, which is sort of a
    cross between Forth and Lisp (see for more
    information). One thing that I'm doing is that all strings will always
    be kept in Normalization Form D (as defined in UAX #15: Normalization
    Forms) for processing. That way all canonically equivalent strings
    return true when tested for equality. It wasn't difficult to implement
    NFD (or NFKD); I just needed to read the transformations from
    UnicodeData.txt and apply them recursively to get a hash table of
    characters to canonical/compatability-decomposed strings. But for most
    I/O purposes, I need to use NFC, re-composing all decomposed
    characters. I have no idea how to do this efficiently. In many cases,
    it's more complicated than just turning two adjacent characters into
    one character.

    I looked at both the Glib source (which defines basic unicode
    operations) and the Normalizer demo that UAX 15 links to (which, btw,
    only works properly for the BMP, which is bad). They both appear to
    use generally the same strategy: perform as many pairwise compositions
    on adjacent characters as possible. I wonder if I'm reading it wrong,
    because if that's how it operates, then one of the examples in the UAX
    wouldn't work properly: NFC(U+017F U+0323 U+0307) = U+1E9B U+0323.
    This composes two non-adjacent characters. Is there any efficient way
    to do this composition without messing up canonical ordering while
    making sure to compose non-adjacent characters like this? It's an edge
    case, I know, but I want my implementation to be correct.

    In many places, the Unicode standard provides clues for
    implementation, but I see none for NFC (or NFKC) and how to compose
    characters. Can anyone help me?

    Daniel Ehrenberg

    This archive was generated by hypermail 2.1.5 : Thu Mar 15 2007 - 20:52:08 CST