Re: Compression through normalization

From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Nov 24 2003 - 13:06:23 EST

  • Next message: John Cowan: "Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"

    On 24/11/2003 07:52, Mark E. Shoulson wrote:

    > On 11/24/03 01:26, Doug Ewell wrote:
    >
    >> So the question becomes: Is it legitimate for a Unicode compression
    >> engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into
    >> another (canonically equivalent) normalization form to improve its
    >> compressibility?
    >>
    > OK, this *is* a fascinating question. ...

    ...

    It seems to me that there is some kind of mixing of levels here. At one
    level, we have a text which consists of a string of Unicode characters,
    and this is the string which can be normalised or denormalised (in fact
    any transformation preserving canonical equivalence) at will. At a lower
    level, we have a sequence of bytes or whatever in a Unicode encoding
    form. And at a still lower level we have a sequence of bytes, which, at
    this level, have no known interpretation. And it is surely at this level
    that lossless compression should operate. Now such a compression scheme
    may receive and use information from a higher level that the byte stream
    is in a particular encoding form of Unicode, and may make use of that
    information as a hint. But it should take this as nothing more than a
    hint, not necessarily reliable, and preserve the byte stream through
    compression and decompression.

    If conformance clause C10 is taken to be operable at all levels, this
    makes a nonsense of the concept of normalisation stability within
    databases etc. If a low level process is permitted to make any
    canonically equivalent transformation, then there can be no guarantee
    that data which is stored in a particular normalisation form is
    retrievable in that same normalisation form, for maybe a low level
    compression or other process has transformed the data on the disk or
    tape or on its way to or from it.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Mon Nov 24 2003 - 14:01:25 EST