Re: Ternary search trees for Unicode dictionaries

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Nov 23 2003 - 19:46:29 EST

  • Next message: Doug Ewell: "Anyone seen Roman Czyborra?"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > OK, this is a transform, but it is still canonically equivalent to the
    > source text. Transformations between canonical equivalent strings is
    > safe (at least for Korean Hangul), and this is what any normalizer
    > performs.

    But compressors aren't normally expected to perform normalization too.

    >> This adds complexity, but is sometimes worth the effort.
    >
    > Not more complex than implementing the standard normalization. In any
    > case, most normalizers perform Hangul syllable normalization
    > algorithmically, so it's a couple of lines to modify to add a check
    > for possible TL/TLV composition in the NFC normalizer, or to decompose
    > TL and TLV but not LV in the NFD normalizer.

    Compressors aren't normally expected to perform normalization, even if
    there is a simple algorithmic way to do it.

    >> The Burrows-Wheeler block-sorting approach, for example, achieves
    >> very good results by adding a preprocessing step before
    >> "conventional" Huffman or arithmetic compression.
    >
    > I don't know what it is but I think it will try to create a unique
    > code from common sequences of codepoints, in order to flatten
    > approximately their distribution statistics. The assigned codes are
    > then sorted by frequency, from where Huffman/Arithmetic encoding of
    > these partially flattened codes will create the variable length codes,
    > with a bit more accuracy than if the intermediate codes were all
    > unsorted, the interest being that the result of Huffman/Arithmetic
    > encoding will contain sequences of frequent bits set to 0, so that it
    > is recompressible by a second Huffman/Arithmetic pass, or by a RLE
    > pass.

    You're right, you don't know what it is. :-)

    See ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz for
    the actual description.

    > It would be interesting to know if good SCSU compressors have been
    > developed, i.e. how they compare when they choose one option or the
    > other for the coding, between switch to UTF-16, temporary escape to
    > UTF-16, management of the most frequent codes in the 8-bit codepage,
    > switch of codepages, reset to default codepages...

    I think you mean "windows" here. SCSU does not deal with legacy code
    pages the way, say, ISO 2022 does.

    > I'm sure that many compressors have been attempted, but not fully
    > experimented to get the best result. So it's quite soon to compare
    > the performance of BOCU-1 and SCSU.

    I know one person who's looking forward to reading my paper.

    > After all, the "deflate" compression method in zlib is public, and all
    > ZIP tools can defalte deflated files, but they are not all equal in
    > terms of effective compression ratio (the fine tuning of compressors
    > between processing speed at compression time and result size requires
    > a lot of efforts, and this is where algorithms are proprietary and
    > often patented, despite the decompressor itself is free and mostly
    > identical between all products, with very few design choices except
    > minor optimizatrions for speed of the decompression process).
    >
    > So we should wait for some time until those who developed the best
    > tunings for "deflate" experiment their technics on SCSU. The current
    > ICU implementation is not too bad, but after looking at some generated
    > sequences, it is not always fully optimal, and I can still find
    > shorter codes manually in some examples not too hard to find, notably
    > in documents using mixed languages or scripts (I have not looked
    > precisely on the case of Korean, but I have looked at the currently
    > excellent result for Russian and Greek, with still some optimizations
    > possible for Latin based languages that use a higher number of accents
    > and use relatively more frequently the Latin Extended letters).

    Hoo boy, do I ever know someone who's looking forward to reading my
    paper.

    > Of course, as BOCU-1 and SCSU are considered as encoding schemes, they
    > cannot themselves alter the represented Unicode strings into
    > canonically equivalent ones. But I'm quite sure that if a process wish
    > to create strings that it knows they will be normalized when read (or
    > the reader will ignore differences of canonically equivalent strings,
    > something that all Unicode-based processes should respect), that
    > process could take it into account within its BOCU-1 or SCSU encoder
    > by allowing them to change the normalization form as long as canonical
    > equivalence is kept. For BOCU-1 this is quite simple as this is a
    > pipeline with separate tasks with little options, but in SCSU the
    > normalization form of the input string interacts with the SCSU-1
    > compressor during its run, and offers more alternatives to the
    > compressor.

    Changing a sequence S to a canonically equivalent sequence S' is
    explicitly allowed by conformance clause C10, but I would be surprised
    if any designers or implementers of Unicode compression schemes have
    considered such a thing. Does this violate users' expectations of what
    "lossless" compression means? This might be a topic all by itself.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sun Nov 23 2003 - 20:34:21 EST