Normalization Implementation Tricks

From: Michael D. Adams (mdmkolbe@gmail.com)
Date: Wed Feb 11 2009 - 20:33:20 CST

  • Next message: Doug Ewell: "Re: Draft proposal for inclusion of the Chinook script in Unicode"

    How do people efficiently implement the (re-)composition table used by
    the normalization algorithm for NFC and NFKD? (I am writting a
    library for a project.)

    The most naive implementation would be a table indexed by a starter
    character and a combining character. Of course that is completely
    unreasonable as it would require 0x110000 * 0x110000 entries (a few
    terabytes).

    If I understand right, ICU library uses shared tries (as the Unicode
    spec suggests) indexed by the starter character that point to lists of
    combining character and result pairs (an association list in
    Lisp/Scheme terminology). This should reduce the size requirements,
    but now there a list we have to scan through which could increase
    run-time access cost.

    Are there any other implementation methods that have a small memory
    footprint (~10-20kb) and quick access (~ 10-20 instructions)? Any
    guidance in this regard would be appriciated.

    Michael D. Adams
    mdmkolbe@gmail.com



    This archive was generated by hypermail 2.1.5 : Wed Feb 11 2009 - 20:37:42 CST