Re: Ternary search trees for Unicode dictionaries

From: Doug Ewell ([email protected])
Date: Sun Nov 23 2003 - 16:06:25 EST

Next message: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"

Previous message: Doug Ewell: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
In reply to: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"
Next in thread: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"
Reply: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> For Korean text, I have found that representation with "defective"
> syllables was performing better through SCSU. I mean here decomposing
> the TLV syllables of the NFC form into T and LV, and TL into T and L,
> i.e. with partial decomposition.
> ...
> With this constraint, Korean is no more acting like Han, and the
> precombined arrangements of LV syllables saves much on the SCSU
> window; gains are also significant for for other binary compressors
> like LZW on any UTF scheme, and even with Huffman or Arithmetic coding
> of UTF-16*/UTF-32* schemes.

This seems reasonable, except that you have to transform the text from
its original representation to this special, compression-friendly
format. Data to be compressed will not come pre-packaged in this
partially decomposed form, but will likely be either fully composed
syllables or fully decomposed jamos. So you really have to perform two
layers of transformation, one to prepare the data for compression and
another to actually compress it, and of course you must do the same
thing in reverse to decompress the data.

This adds complexity, but is sometimes worth the effort. The
Burrows-Wheeler block-sorting approach, for example, achieves very good
results by adding a preprocessing step before "conventional" Huffman or
arithmetic compression.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"
Previous message: Doug Ewell: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
In reply to: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"
Next in thread: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"
Reply: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Nov 23 2003 - 16:42:37 EST