RE: Ternary search trees for Unicode dictionaries

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Nov 23 2003 - 18:42:06 EST

Next message: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"

Previous message: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
In reply to: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Reply: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]De la
> part de Doug Ewell
> Envoyé : dimanche 23 novembre 2003 22:06
> À : Unicode Mailing List
> Cc : verdy_p@wanadoo.fr; Jungshik Shin
> Objet : Re: Ternary search trees for Unicode dictionaries
>
>
> Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
>
> > For Korean text, I have found that representation with "defective"
> > syllables was performing better through SCSU. I mean here decomposing
> > the TLV syllables of the NFC form into T and LV, and TL into T and L,
> > i.e. with partial decomposition.
> > ...
> > With this constraint, Korean is no more acting like Han, and the
> > precombined arrangements of LV syllables saves much on the SCSU
> > window; gains are also significant for for other binary compressors
> > like LZW on any UTF scheme, and even with Huffman or Arithmetic coding
> > of UTF-16*/UTF-32* schemes.
>
> This seems reasonable, except that you have to transform the text from
> its original representation to this special, compression-friendly
> format.

OK, this is a transform, but it is still canonically equivalent to the
source text. Transformations between canonical equivalent strings is safe
(at least for Korean Hangul), and this is what any normalizer performs.

> Data to be compressed will not come pre-packaged in this
> partially decomposed form, but will likely be either fully composed
> syllables or fully decomposed jamos. So you really have to perform two
> layers of transformation, one to prepare the data for compression and
> another to actually compress it, and of course you must do the same
> thing in reverse to decompress the data.
>
> This adds complexity, but is sometimes worth the effort.

Not more complex than implementing the standard normalization. In any case,
most normalizers perform Hangul syllable normalization algorithmically, so
it's a couple of lines to modify to add a check for possible TL/TLV
composition in the NFC normalizer, or to decompose TL and TLV but not LV in
the NFD normalizer.

> The Burrows-Wheeler block-sorting approach, for example, achieves very
good results by adding a preprocessing step before "conventional" Huffman or
arithmetic compression.

I don't know what it is but I think it will try to create a unique code from
common sequences of codepoints, in order to flatten approximately their
distribution statistics. The assigned codes are then sorted by frequency,
from where Huffman/Arithmetic encoding of these partially flattened codes
will create the variable length codes, with a bit more accuracy than if the
intermediate codes were all unsorted, the interest being that the result of
Huffman/Arithmetic encoding will contain sequences of frequent bits set to
0, so that it is recompressible by a second Huffman/Arithmetic pass, or by a
RLE pass.

I have also seen compression based on ADPCM coding for audio PCM modulation
data (i.e. compressing differences of codes), and some other technics
accessible in SCSU.

It would be interesting to know if good SCSU compressors have been
developed, i.e. how they compare when they choose one option or the other
for the coding, between switch to UTF-16, temporary escape to UTF-16,
management of the most frequent codes in the 8-bit codepage, switch of
codepages, reset to default codepages... I'm sure that many compressors have
been attempted, but not fully experimented to get the best result. So it's
quite soon to compare the performance of BOCU-1 and SCSU.

After all, the "deflate" compression method in zlib is public, and all ZIP
tools can defalte deflated files, but they are not all equal in terms of
effective compression ratio (the fine tuning of compressors between
processing speed at compression time and result size requires a lot of
efforts, and this is where algorithms are proprietary and often patented,
despite the decompressor itself is free and mostly identical between all
products, with very few design choices except minor optimizatrions for speed
of the decompression process).

So we should wait for some time until those who developed the best tunings
for "deflate" experiment their technics on SCSU. The current ICU
implementation is not too bad, but after looking at some generated
sequences, it is not always fully optimal, and I can still find shorter
codes manually in some examples not too hard to find, notably in documents
using mixed languages or scripts (I have not looked precisely on the case of
Korean, but I have looked at the currently excellent result for Russian and
Greek, with still some optimizations possible for Latin based languages that
use a higher number of accents and use relatively more frequently the Latin
Extended letters).

Of course, as BOCU-1 and SCSU are considered as encoding schemes, they
cannot themselves alter the represented Unicode strings into canonically
equivalent ones. But I'm quite sure that if a process wish to create strings
that it knows they will be normalized when read (or the reader will ignore
differences of canonically equivalent strings, something that all
Unicode-based processes should respect), that process could take it into
account within its BOCU-1 or SCSU encoder by allowing them to change the
normalization form as long as canonical equivalence is kept. For BOCU-1 this
is quite simple as this is a pipeline with separate tasks with little
options, but in SCSU the normalization form of the input string interacts
with the SCSU-1 compressor during its run, and offers more alternatives to
the compressor.

This allowed modification of text is not destructive even in the case of
security signatures (as long as these signatures are created on input data
in NFC or NFD form, and not on free forms): the signed text may be decoded
and renormalized to the form expected by the signature, before computing its
digest.

__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com

application/ms-tnef attachment: winmail.dat

Next message: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Previous message: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
In reply to: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Reply: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Nov 23 2003 - 19:34:26 EST