**From:** Philippe Verdy (*verdy_p@wanadoo.fr*)

**Date:** Tue Nov 18 2003 - 16:36:25 EST

**Previous message:**Peter Constable: "RE: Definitions"**In reply to:**Markus Scherer: "Re: Ternary search trees for Unicode dictionaries"**Next in thread:**Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"**Reply:**Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

From: "Markus Scherer" <markus.scherer@jtcsv.com>

*> > Nothing forbids to code each n-ary node as a binary tree for faster
*

*> > searches.
*

*>
*

*> Or, more simply and obviously, do a binary search on the line array in
*

your node if the node

*> contains more than just a few items.
*

A binary search in a vector is implicitly creating such binary tree. That's

what is done in most B-tree implementations, where nodes of variable sizes

are grouped in vectors filling pages of a constant size, and this is

typically used in almost all SQL engines to create indexes.

There are a lot of options to tune how nodes should be encoded, but a Trie

can also be stored in a very fast accessed B-tree, and become quite compact

this way. The algorithms needed to feed the B-tree and then compact it to

its minimal size with a page fill factor of 100% exist in many RDBMS engines

or database toolkits (you may look for example to the various sources that

work on dBase(tm) or other VSAM sorted tables), so during feed of the

dictionnary entries, you may just use the minimum fill factor of 50%.

Another way to compact the dictionnary is to use a LZW-like string lookup

dictionnary, except that you won't limit the size of its internal

dictionnary. At least with this system, you don't need any prior choice of

encoding. This solution can help to keep the Trie stored according to the

language collation rules (the LZW lookup dictionnary is a separate heap

whose internal ordering is unrelated, and which may be kept coded with

UTF-16 code units).

That's true that using SCSU requires that you use a stable compressor to use

it in a Trie. I think it's a design issue of SCSU: there's no standard

compressor in the standard to create a predictable compressed string, that

can be compared. I think that SCSU would have benefited of having a

"reference implementation" whose result can be compared with exact binary

matching. In SCSU, you can't compare two compressed strings without first

decompressing them to some UTF.

But this does not mean that SCSU cannot be used in a Trie to store a lexical

dictionnary. Without it, of course, you can use BOCU, but it will compress

strings less. There's however another option: use a reencoder that will take

the LZW lookup dictionnary initially built from some UTF like UTF-32, and

that will apply a statistic Huffman or Arithmetic encoding of each coded

character. Where the frequency of letters will determine their numeric value

in the generated encoding. You get then a bijective encoding where all

UTF-32 code units are replaced by their Huffman rank, the lowest rank

getting encoded with less bits. Then the Trie will not store directly the

characters, but the bit position in the LZW dictionnary heap or characters

coded with variable lengths of bits. The correspondance between UTF-32 code

units and Huffman/Arithmetic codes will be made by a simple vector of code

units (these sorted by encoding bitsize and statistic rank.)

Finally, the number of code units in that Huffman/Arithmetic correspondance

vector determines the initial offset to apply to encode the compressed LZW

indice of each substring stored in each Trie node.

With this solution, each Trie node stores only one indice, the lowest values

referencing code units in the Huffman/Arithmetic correspondance table (and

the higher values coding substrings stored at the indicated bitposition in

the LZW heap). This allows to apply a variable length encoding for these

indices if needed to compress each node of the Trie, which will finally be

mapped onto the B-Tree.

The correspondance vector of unique UTF-32 code units can also be compacted

to use less bits per code unit if needed (this does not affect the storage

of the LZW heap, or of the Trie in the B-Tree).

**Next message:**Michael Everson: "That alphabetician on radio...."**Previous message:**Peter Constable: "RE: Definitions"**In reply to:**Markus Scherer: "Re: Ternary search trees for Unicode dictionaries"**Next in thread:**Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"**Reply:**Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]**Mail actions:**[ respond to this message ] [ mail a new topic ]

*
This archive was generated by hypermail 2.1.5
: Tue Nov 18 2003 - 17:28:38 EST
*