Re: Ternary search trees for Unicode dictionaries

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Nov 23 2003 - 16:06:25 EST

  • Next message: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > For Korean text, I have found that representation with "defective"
    > syllables was performing better through SCSU. I mean here decomposing
    > the TLV syllables of the NFC form into T and LV, and TL into T and L,
    > i.e. with partial decomposition.
    > ...
    > With this constraint, Korean is no more acting like Han, and the
    > precombined arrangements of LV syllables saves much on the SCSU
    > window; gains are also significant for for other binary compressors
    > like LZW on any UTF scheme, and even with Huffman or Arithmetic coding
    > of UTF-16*/UTF-32* schemes.

    This seems reasonable, except that you have to transform the text from
    its original representation to this special, compression-friendly
    format. Data to be compressed will not come pre-packaged in this
    partially decomposed form, but will likely be either fully composed
    syllables or fully decomposed jamos. So you really have to perform two
    layers of transformation, one to prepare the data for compression and
    another to actually compress it, and of course you must do the same
    thing in reverse to decompress the data.

    This adds complexity, but is sometimes worth the effort. The
    Burrows-Wheeler block-sorting approach, for example, achieves very good
    results by adding a preprocessing step before "conventional" Huffman or
    arithmetic compression.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sun Nov 23 2003 - 16:42:37 EST