Re: Compression through normalization

From: Peter Kirk (
Date: Mon Nov 24 2003 - 13:06:23 EST

  • Next message: John Cowan: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"

    On 24/11/2003 07:52, Mark E. Shoulson wrote:

    > On 11/24/03 01:26, Doug Ewell wrote:
    >> So the question becomes: Is it legitimate for a Unicode compression
    >> engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into
    >> another (canonically equivalent) normalization form to improve its
    >> compressibility?
    > OK, this *is* a fascinating question. ...


    It seems to me that there is some kind of mixing of levels here. At one
    level, we have a text which consists of a string of Unicode characters,
    and this is the string which can be normalised or denormalised (in fact
    any transformation preserving canonical equivalence) at will. At a lower
    level, we have a sequence of bytes or whatever in a Unicode encoding
    form. And at a still lower level we have a sequence of bytes, which, at
    this level, have no known interpretation. And it is surely at this level
    that lossless compression should operate. Now such a compression scheme
    may receive and use information from a higher level that the byte stream
    is in a particular encoding form of Unicode, and may make use of that
    information as a hint. But it should take this as nothing more than a
    hint, not necessarily reliable, and preserve the byte stream through
    compression and decompression.

    If conformance clause C10 is taken to be operable at all levels, this
    makes a nonsense of the concept of normalisation stability within
    databases etc. If a low level process is permitted to make any
    canonically equivalent transformation, then there can be no guarantee
    that data which is stored in a particular normalisation form is
    retrievable in that same normalisation form, for maybe a low level
    compression or other process has transformed the data on the disk or
    tape or on its way to or from it.

    Peter Kirk (personal) (work)

    This archive was generated by hypermail 2.1.5 : Mon Nov 24 2003 - 14:01:25 EST