Re: Ternary search trees for Unicode dictionaries

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 18 2003 - 16:36:25 EST

Next message: Michael Everson: "That alphabetician on radio...."

Previous message: Peter Constable: "RE: Definitions"
In reply to: Markus Scherer: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Reply: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Markus Scherer" <markus.scherer@jtcsv.com>
> > Nothing forbids to code each n-ary node as a binary tree for faster
> > searches.
>
> Or, more simply and obviously, do a binary search on the line array in
your node if the node
> contains more than just a few items.

A binary search in a vector is implicitly creating such binary tree. That's
what is done in most B-tree implementations, where nodes of variable sizes
are grouped in vectors filling pages of a constant size, and this is
typically used in almost all SQL engines to create indexes.

There are a lot of options to tune how nodes should be encoded, but a Trie
can also be stored in a very fast accessed B-tree, and become quite compact
this way. The algorithms needed to feed the B-tree and then compact it to
its minimal size with a page fill factor of 100% exist in many RDBMS engines
or database toolkits (you may look for example to the various sources that
work on dBase(tm) or other VSAM sorted tables), so during feed of the
dictionnary entries, you may just use the minimum fill factor of 50%.

Another way to compact the dictionnary is to use a LZW-like string lookup
dictionnary, except that you won't limit the size of its internal
dictionnary. At least with this system, you don't need any prior choice of
encoding. This solution can help to keep the Trie stored according to the
language collation rules (the LZW lookup dictionnary is a separate heap
whose internal ordering is unrelated, and which may be kept coded with
UTF-16 code units).

That's true that using SCSU requires that you use a stable compressor to use
it in a Trie. I think it's a design issue of SCSU: there's no standard
compressor in the standard to create a predictable compressed string, that
can be compared. I think that SCSU would have benefited of having a
"reference implementation" whose result can be compared with exact binary
matching. In SCSU, you can't compare two compressed strings without first
decompressing them to some UTF.

But this does not mean that SCSU cannot be used in a Trie to store a lexical
dictionnary. Without it, of course, you can use BOCU, but it will compress
strings less. There's however another option: use a reencoder that will take
the LZW lookup dictionnary initially built from some UTF like UTF-32, and
that will apply a statistic Huffman or Arithmetic encoding of each coded
character. Where the frequency of letters will determine their numeric value
in the generated encoding. You get then a bijective encoding where all
UTF-32 code units are replaced by their Huffman rank, the lowest rank
getting encoded with less bits. Then the Trie will not store directly the
characters, but the bit position in the LZW dictionnary heap or characters
coded with variable lengths of bits. The correspondance between UTF-32 code
units and Huffman/Arithmetic codes will be made by a simple vector of code
units (these sorted by encoding bitsize and statistic rank.)
Finally, the number of code units in that Huffman/Arithmetic correspondance
vector determines the initial offset to apply to encode the compressed LZW
indice of each substring stored in each Trie node.

With this solution, each Trie node stores only one indice, the lowest values
referencing code units in the Huffman/Arithmetic correspondance table (and
the higher values coding substrings stored at the indicated bitposition in
the LZW heap). This allows to apply a variable length encoding for these
indices if needed to compress each node of the Trie, which will finally be
mapped onto the B-Tree.

The correspondance vector of unique UTF-32 code units can also be compacted
to use less bits per code unit if needed (this does not affect the storage
of the LZW heap, or of the Trie in the B-Tree).

Next message: Michael Everson: "That alphabetician on radio...."
Previous message: Peter Constable: "RE: Definitions"
In reply to: Markus Scherer: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Reply: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 18 2003 - 17:28:38 EST