RE: Ternary search trees for Unicode dictionaries

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Nov 23 2003 - 18:42:06 EST

  • Next message: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"

    > De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]De la
    > part de Doug Ewell
    > Envoyé : dimanche 23 novembre 2003 22:06
    > À : Unicode Mailing List
    > Cc : verdy_p@wanadoo.fr; Jungshik Shin
    > Objet : Re: Ternary search trees for Unicode dictionaries
    >
    >
    > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
    >
    > > For Korean text, I have found that representation with "defective"
    > > syllables was performing better through SCSU. I mean here decomposing
    > > the TLV syllables of the NFC form into T and LV, and TL into T and L,
    > > i.e. with partial decomposition.
    > > ...
    > > With this constraint, Korean is no more acting like Han, and the
    > > precombined arrangements of LV syllables saves much on the SCSU
    > > window; gains are also significant for for other binary compressors
    > > like LZW on any UTF scheme, and even with Huffman or Arithmetic coding
    > > of UTF-16*/UTF-32* schemes.
    >
    > This seems reasonable, except that you have to transform the text from
    > its original representation to this special, compression-friendly
    > format.

    OK, this is a transform, but it is still canonically equivalent to the
    source text. Transformations between canonical equivalent strings is safe
    (at least for Korean Hangul), and this is what any normalizer performs.

    > Data to be compressed will not come pre-packaged in this
    > partially decomposed form, but will likely be either fully composed
    > syllables or fully decomposed jamos. So you really have to perform two
    > layers of transformation, one to prepare the data for compression and
    > another to actually compress it, and of course you must do the same
    > thing in reverse to decompress the data.
    >
    > This adds complexity, but is sometimes worth the effort.

    Not more complex than implementing the standard normalization. In any case,
    most normalizers perform Hangul syllable normalization algorithmically, so
    it's a couple of lines to modify to add a check for possible TL/TLV
    composition in the NFC normalizer, or to decompose TL and TLV but not LV in
    the NFD normalizer.

    > The Burrows-Wheeler block-sorting approach, for example, achieves very
    good results by adding a preprocessing step before "conventional" Huffman or
    arithmetic compression.

    I don't know what it is but I think it will try to create a unique code from
    common sequences of codepoints, in order to flatten approximately their
    distribution statistics. The assigned codes are then sorted by frequency,
    from where Huffman/Arithmetic encoding of these partially flattened codes
    will create the variable length codes, with a bit more accuracy than if the
    intermediate codes were all unsorted, the interest being that the result of
    Huffman/Arithmetic encoding will contain sequences of frequent bits set to
    0, so that it is recompressible by a second Huffman/Arithmetic pass, or by a
    RLE pass.

    I have also seen compression based on ADPCM coding for audio PCM modulation
    data (i.e. compressing differences of codes), and some other technics
    accessible in SCSU.

    It would be interesting to know if good SCSU compressors have been
    developed, i.e. how they compare when they choose one option or the other
    for the coding, between switch to UTF-16, temporary escape to UTF-16,
    management of the most frequent codes in the 8-bit codepage, switch of
    codepages, reset to default codepages... I'm sure that many compressors have
    been attempted, but not fully experimented to get the best result. So it's
    quite soon to compare the performance of BOCU-1 and SCSU.

    After all, the "deflate" compression method in zlib is public, and all ZIP
    tools can defalte deflated files, but they are not all equal in terms of
    effective compression ratio (the fine tuning of compressors between
    processing speed at compression time and result size requires a lot of
    efforts, and this is where algorithms are proprietary and often patented,
    despite the decompressor itself is free and mostly identical between all
    products, with very few design choices except minor optimizatrions for speed
    of the decompression process).

    So we should wait for some time until those who developed the best tunings
    for "deflate" experiment their technics on SCSU. The current ICU
    implementation is not too bad, but after looking at some generated
    sequences, it is not always fully optimal, and I can still find shorter
    codes manually in some examples not too hard to find, notably in documents
    using mixed languages or scripts (I have not looked precisely on the case of
    Korean, but I have looked at the currently excellent result for Russian and
    Greek, with still some optimizations possible for Latin based languages that
    use a higher number of accents and use relatively more frequently the Latin
    Extended letters).

    Of course, as BOCU-1 and SCSU are considered as encoding schemes, they
    cannot themselves alter the represented Unicode strings into canonically
    equivalent ones. But I'm quite sure that if a process wish to create strings
    that it knows they will be normalized when read (or the reader will ignore
    differences of canonically equivalent strings, something that all
    Unicode-based processes should respect), that process could take it into
    account within its BOCU-1 or SCSU encoder by allowing them to change the
    normalization form as long as canonical equivalence is kept. For BOCU-1 this
    is quite simple as this is a pipeline with separate tasks with little
    options, but in SCSU the normalization form of the input string interacts
    with the SCSU-1 compressor during its run, and offers more alternatives to
    the compressor.

    This allowed modification of text is not destructive even in the case of
    security signatures (as long as these signatures are created on input data
    in NFC or NFD form, and not on free forms): the signed text may be decoded
    and renormalized to the form expected by the signature, before computing its
    digest.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Sun Nov 23 2003 - 19:34:26 EST