Re: Normalization Implementation Tricks

From: Martin v. Löwis (martin@v.loewis.de)
Date: Mon Feb 16 2009 - 03:04:46 CST

  • Next message: vanisaac@boil.afraid.org: "Updated draft of Chinook proposal."

    > The most naive implementation would be a table indexed by a starter
    > character and a combining character. Of course that is completely
    > unreasonable as it would require 0x110000 * 0x110000 entries (a few
    > terabytes).
    >
    [...]
    > Are there any other implementation methods that have a small memory
    > footprint (~10-20kb) and quick access (~ 10-20 instructions)? Any
    > guidance in this regard would be appriciated.

    In Python, we observe that there are 366 starter characters, and
    52 combining characters. So if some code point is not a starter,
    or its subsequent character is not a combiner, you can proceed.
    Otherwise, you have to consider 366*52 combinations. We put them into
    a trie, with 4956 bytes for the index, and 8884 bytes for the index.

    To lookup, you need to compute the index of the starter, and the
    index of the combiner, which we do as a linear search (although
    a binary search would probably be better; even faster would be
    a trie again).

    Regards,
    Martin



    This archive was generated by hypermail 2.1.5 : Mon Feb 16 2009 - 03:10:49 CST