Re: Ternary search trees for Unicode dictionaries

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 18 2003 - 16:36:25 EST

  • Next message: Michael Everson: "That alphabetician on radio...."

    From: "Markus Scherer" <markus.scherer@jtcsv.com>
    > > Nothing forbids to code each n-ary node as a binary tree for faster
    > > searches.
    >
    > Or, more simply and obviously, do a binary search on the line array in
    your node if the node
    > contains more than just a few items.

    A binary search in a vector is implicitly creating such binary tree. That's
    what is done in most B-tree implementations, where nodes of variable sizes
    are grouped in vectors filling pages of a constant size, and this is
    typically used in almost all SQL engines to create indexes.

    There are a lot of options to tune how nodes should be encoded, but a Trie
    can also be stored in a very fast accessed B-tree, and become quite compact
    this way. The algorithms needed to feed the B-tree and then compact it to
    its minimal size with a page fill factor of 100% exist in many RDBMS engines
    or database toolkits (you may look for example to the various sources that
    work on dBase(tm) or other VSAM sorted tables), so during feed of the
    dictionnary entries, you may just use the minimum fill factor of 50%.

    Another way to compact the dictionnary is to use a LZW-like string lookup
    dictionnary, except that you won't limit the size of its internal
    dictionnary. At least with this system, you don't need any prior choice of
    encoding. This solution can help to keep the Trie stored according to the
    language collation rules (the LZW lookup dictionnary is a separate heap
    whose internal ordering is unrelated, and which may be kept coded with
    UTF-16 code units).

    That's true that using SCSU requires that you use a stable compressor to use
    it in a Trie. I think it's a design issue of SCSU: there's no standard
    compressor in the standard to create a predictable compressed string, that
    can be compared. I think that SCSU would have benefited of having a
    "reference implementation" whose result can be compared with exact binary
    matching. In SCSU, you can't compare two compressed strings without first
    decompressing them to some UTF.

    But this does not mean that SCSU cannot be used in a Trie to store a lexical
    dictionnary. Without it, of course, you can use BOCU, but it will compress
    strings less. There's however another option: use a reencoder that will take
    the LZW lookup dictionnary initially built from some UTF like UTF-32, and
    that will apply a statistic Huffman or Arithmetic encoding of each coded
    character. Where the frequency of letters will determine their numeric value
    in the generated encoding. You get then a bijective encoding where all
    UTF-32 code units are replaced by their Huffman rank, the lowest rank
    getting encoded with less bits. Then the Trie will not store directly the
    characters, but the bit position in the LZW dictionnary heap or characters
    coded with variable lengths of bits. The correspondance between UTF-32 code
    units and Huffman/Arithmetic codes will be made by a simple vector of code
    units (these sorted by encoding bitsize and statistic rank.)
    Finally, the number of code units in that Huffman/Arithmetic correspondance
    vector determines the initial offset to apply to encode the compressed LZW
    indice of each substring stored in each Trie node.

    With this solution, each Trie node stores only one indice, the lowest values
    referencing code units in the Huffman/Arithmetic correspondance table (and
    the higher values coding substrings stored at the indicated bitposition in
    the LZW heap). This allows to apply a variable length encoding for these
    indices if needed to compress each node of the Trie, which will finally be
    mapped onto the B-Tree.

    The correspondance vector of unique UTF-32 code units can also be compacted
    to use less bits per code unit if needed (this does not affect the storage
    of the LZW heap, or of the Trie in the B-Tree).



    This archive was generated by hypermail 2.1.5 : Tue Nov 18 2003 - 17:28:38 EST