Re: Normalization Implementation Tricks

From: verdy_p (verdy_p@wanadoo.fr)
Date: Fri Feb 13 2009 - 12:36:43 CST

  • Next message: Martin v. Löwis: "Re: Normalization Implementation Tricks"

    Despite what you are arguing there, you seem to not relaise that the Normalization tables need not be modified ni
    practive (the time to build the table in constant time but with multiple complex hash functions will not really
    matter, but the time needed to generate multiple hashes to better fill the hash table will mean that the
    performance will be impacted fo the most frequent case: read-only lookups.
    Hash arrays are still the best option for scalable tables (that can be updated and lookup roughly as frequently).
    But for normalization purpose, these tables will typically never need to be changed, so the linear lookup in case
    of collision within a moderately filled table will still be the best option, notably for such hash tables whose
    cardinality remains constant during reazd-only lookups, so the frequence of collisions will remani low.)
    Cukoo hash tables are only interesting in tables whose average caradinality remains stable but where there are
    frequent updates: they are best suited for example when handling symbol lookup tables in dynamic languages, but I
    don't see any good rationale of using it for normalization lookup tables, for which you should only need ONE and
    only ONE quite fast hash function, not necessarily a secure one but as simple as a simple congruent multiplication
    before adding each successive term).
    Where would cukko be interesting ? certainly not for large lookup tables, and not for very small lookup tables
    because the cost of the chained approach will remain small and can be kept small by not using a too high fill rate
    for the table (it's acceptable in this case to have 50% empty buckets in this case, when the cardinality is small).
    For Unicode normalization lookup tables, most lookup tables will have a moderate size. The approach based on Tries
    is definitely the best, even if the Trie compression is quite complex to compute (but before genetrating the
    compressed Trie, it's simple to generate a large table with low fill factor and then compress it independantly with
    a single index table that are offeseting common parts of the table into a compact vector where the number of
    chained collisions will remain the same as before the compression).
    Do you have some application that really demonstrates the need for a Cuckoo approach ? I don't think so: the
    normalization tables are not supposed to be changed dynamically, they will be precomputed and loaded once, and
    changed only after the release of a new Unicode version: about once or twice every year at most, but most often
    this update will be needed less than once a year!

    Consider this, then you'll immediately realize that the hash structure should be optimized ONLY for the only case
    of read-only lookups. In this application, Cuckoo hash tablers are useless: the price you pay to get more compact
    tables is lost by trying to avoid collisions with too complex hash functions that need to be computed too
    frequently for each character to lookup. consider then the simpler approach using simle chaining within a table
    with low fill rate and then compess this "large" table using a Trie (the only cost will be a single indirection
    with no complex additional hash funtion to compute). nOte also that the Trie documetned in Unicode does not even
    use any hash function but can use the codepoint itself directly because it is structured in pages/rows and because
    most elements that need to be looked up are grouped within the same reduced set of rows (notably for the Latin
    script).

    So all I can suggest you is to experiment with your code, and compare the lookup time with the chained table
    compressed with a Trie: I really doubt that you'll get more performance and i even doubt that your tables will be
    significantly smaller (so that it will be beneficial for data locality in processor's and bus's data caches).

    > Message du 13/02/09 11:07
    > De : "Jeroen Ruigrok van der Werven"
    > A : unicode@unicode.org
    > Copie à :
    > Objet : Re: Normalization Implementation Tricks
    >
    >
    > -On [20090212 13:33], Henrik Theiling (theiling@absint.com) wrote:
    > >I am using a tool for generating a static cookoo hash table. It
    > >produces a compact representation and has really fast access.
    >
    > For people who try to find cookoo, it's actually cuckoo:
    >
    > http://en.wikipedia.org/wiki/Cuckoo_hashing
    >
    > --
    > Jeroen Ruigrok van der Werven / asmodai
    > イェルーン ラウフロック ヴァン デル ウェルヴェン
    > http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
    > Earth to earth, ashes to ashes, dust to dust...
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Fri Feb 13 2009 - 12:40:04 CST