Re: Normalization Implementation Tricks

From: verdy_p (verdy_p@wanadoo.fr)
Date: Fri Feb 13 2009 - 12:36:43 CST

Next message: Martin v. L�wis: "Re: Normalization Implementation Tricks"

Previous message: Jeroen Ruigrok van der Werven: "Re: Normalization Implementation Tricks"
In reply to: Jeroen Ruigrok van der Werven: "Re: Normalization Implementation Tricks"
Next in thread: Richard Ishida: "RE: Normalization Implementation Tricks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Despite what you are arguing there, you seem to not relaise that the Normalization tables need not be modified ni
practive (the time to build the table in constant time but with multiple complex hash functions will not really
matter, but the time needed to generate multiple hashes to better fill the hash table will mean that the
performance will be impacted fo the most frequent case: read-only lookups.
Hash arrays are still the best option for scalable tables (that can be updated and lookup roughly as frequently).
But for normalization purpose, these tables will typically never need to be changed, so the linear lookup in case
of collision within a moderately filled table will still be the best option, notably for such hash tables whose
cardinality remains constant during reazd-only lookups, so the frequence of collisions will remani low.)
Cukoo hash tables are only interesting in tables whose average caradinality remains stable but where there are
frequent updates: they are best suited for example when handling symbol lookup tables in dynamic languages, but I
don't see any good rationale of using it for normalization lookup tables, for which you should only need ONE and
only ONE quite fast hash function, not necessarily a secure one but as simple as a simple congruent multiplication
before adding each successive term).
Where would cukko be interesting ? certainly not for large lookup tables, and not for very small lookup tables
because the cost of the chained approach will remain small and can be kept small by not using a too high fill rate
for the table (it's acceptable in this case to have 50% empty buckets in this case, when the cardinality is small).
For Unicode normalization lookup tables, most lookup tables will have a moderate size. The approach based on Tries
is definitely the best, even if the Trie compression is quite complex to compute (but before genetrating the
compressed Trie, it's simple to generate a large table with low fill factor and then compress it independantly with
a single index table that are offeseting common parts of the table into a compact vector where the number of
chained collisions will remain the same as before the compression).
Do you have some application that really demonstrates the need for a Cuckoo approach ? I don't think so: the
normalization tables are not supposed to be changed dynamically, they will be precomputed and loaded once, and
changed only after the release of a new Unicode version: about once or twice every year at most, but most often
this update will be needed less than once a year!

Consider this, then you'll immediately realize that the hash structure should be optimized ONLY for the only case
of read-only lookups. In this application, Cuckoo hash tablers are useless: the price you pay to get more compact
tables is lost by trying to avoid collisions with too complex hash functions that need to be computed too
frequently for each character to lookup. consider then the simpler approach using simle chaining within a table
with low fill rate and then compess this "large" table using a Trie (the only cost will be a single indirection
with no complex additional hash funtion to compute). nOte also that the Trie documetned in Unicode does not even
use any hash function but can use the codepoint itself directly because it is structured in pages/rows and because
most elements that need to be looked up are grouped within the same reduced set of rows (notably for the Latin
script).

So all I can suggest you is to experiment with your code, and compare the lookup time with the chained table
compressed with a Trie: I really doubt that you'll get more performance and i even doubt that your tables will be
significantly smaller (so that it will be beneficial for data locality in processor's and bus's data caches).

> Message du 13/02/09 11:07
> De : "Jeroen Ruigrok van der Werven"
> A : unicode@unicode.org
> Copie à :
> Objet : Re: Normalization Implementation Tricks
>
>
> -On [20090212 13:33], Henrik Theiling (theiling@absint.com) wrote:
> >I am using a tool for generating a static cookoo hash table. It
> >produces a compact representation and has really fast access.
>
> For people who try to find cookoo, it's actually cuckoo:
>
> http://en.wikipedia.org/wiki/Cuckoo_hashing
>
> --
> Jeroen Ruigrok van der Werven / asmodai
> イェルーンラウフロックヴァンデルウェルヴェン
> http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
> Earth to earth, ashes to ashes, dust to dust...
>
>
>

Next message: Martin v. L�wis: "Re: Normalization Implementation Tricks"
Previous message: Jeroen Ruigrok van der Werven: "Re: Normalization Implementation Tricks"
In reply to: Jeroen Ruigrok van der Werven: "Re: Normalization Implementation Tricks"
Next in thread: Richard Ishida: "RE: Normalization Implementation Tricks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Feb 13 2009 - 12:40:04 CST