Re: Ternary search trees for Unicode dictionaries

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 19 2003 - 05:13:48 EST

  • Next message: Philippe Verdy: "Re: BOM as WJ?"

    From: "Doug Ewell" <dewell@adelphia.net>
    > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
    > > There's however another option: use a reencoder that will take the LZW
    > > lookup dictionnary initially built from some UTF like UTF-32, and that
    > > will apply a statistic Huffman or Arithmetic encoding of each coded
    > > character...
    >
    > You may be interested in a paper I've written on Unicode compression,
    > due out in a week or so. Basically it seems that virtually all existing
    > Huffman encoders (and probably arithmetic encoders as well) assume 8-bit
    > input, and there is a concern that retargeting such tools to work with
    > 16- or 32-bit Unicode text may have a significant effect on speed. IMHO
    > the compression benefits would be big-time, though.

    The question of processing performance may not be very relevant to create a
    lexical dictionnary, where what is relevant is the final size of the
    dictionnary file. As long as the compressed data can be accessed fast, the
    size of the intermediate tables and the time needed to compute the
    dictionnary structure will not be decisive.

    In fact, almost all lexical dictionnaries I have seen use data compression
    using things like elimination of common prefix/infix substrings, LZW lookup
    heap for substrings, and an Huffman or Arithmetic coding of this heap
    instead of storing directly each character code position or code point.

    As there are a lot of applications of lexical dictionnaries, I think it
    could be a good subproject within ICU, with the necessary APIs to setup a
    new dictionnary with a specified localized collation order, feed the
    dictionnary, store it in a stream, reload it, and perform lookup of entries.
    The dictionnary should be sorted using a UCA-based collation with a
    tailoring appropriate for the language, and should be able to store
    associated data, and be enumerable with browsing methods like nextEntry(),
    previousEntry(), getEntryName(), getEntryValue(), setEntryValue()...

    For now, such dictionnary exists for Thai word breakers, but general text
    processing requires such dictionnaries in almost all languages for spelling
    correction, hyphenation, or grammatical analysis and correction, and
    sometimes for some input method editors (think about the "easy edit" mode of
    GSM phones where you just press one key per letter)...



    This archive was generated by hypermail 2.1.5 : Wed Nov 19 2003 - 06:05:06 EST