Re: Ternary search trees for Unicode dictionaries

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 19 2003 - 05:13:48 EST

Next message: Philippe Verdy: "Re: BOM as WJ?"

Previous message: Pim Blokland: "BOM as WJ?"
In reply to: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Reply: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Doug Ewell" <dewell@adelphia.net>
> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> > There's however another option: use a reencoder that will take the LZW
> > lookup dictionnary initially built from some UTF like UTF-32, and that
> > will apply a statistic Huffman or Arithmetic encoding of each coded
> > character...
>
> You may be interested in a paper I've written on Unicode compression,
> due out in a week or so. Basically it seems that virtually all existing
> Huffman encoders (and probably arithmetic encoders as well) assume 8-bit
> input, and there is a concern that retargeting such tools to work with
> 16- or 32-bit Unicode text may have a significant effect on speed. IMHO
> the compression benefits would be big-time, though.

The question of processing performance may not be very relevant to create a
lexical dictionnary, where what is relevant is the final size of the
dictionnary file. As long as the compressed data can be accessed fast, the
size of the intermediate tables and the time needed to compute the
dictionnary structure will not be decisive.

In fact, almost all lexical dictionnaries I have seen use data compression
using things like elimination of common prefix/infix substrings, LZW lookup
heap for substrings, and an Huffman or Arithmetic coding of this heap
instead of storing directly each character code position or code point.

As there are a lot of applications of lexical dictionnaries, I think it
could be a good subproject within ICU, with the necessary APIs to setup a
new dictionnary with a specified localized collation order, feed the
dictionnary, store it in a stream, reload it, and perform lookup of entries.
The dictionnary should be sorted using a UCA-based collation with a
tailoring appropriate for the language, and should be able to store
associated data, and be enumerable with browsing methods like nextEntry(),
previousEntry(), getEntryName(), getEntryValue(), setEntryValue()...

For now, such dictionnary exists for Thai word breakers, but general text
processing requires such dictionnaries in almost all languages for spelling
correction, hyphenation, or grammatical analysis and correction, and
sometimes for some input method editors (think about the "easy edit" mode of
GSM phones where you just press one key per letter)...

Next message: Philippe Verdy: "Re: BOM as WJ?"
Previous message: Pim Blokland: "BOM as WJ?"
In reply to: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Reply: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 06:05:06 EST