Re: Compression through normalization

From: jon@hackcraft.net
Date: Mon Dec 01 2003 - 05:50:28 EST

  • Next message: Michael Everson: "RE: Oriya: mba / mwa ?"

    Quoting Doug Ewell <dewell@adelphia.net>:

    > Someone, I forgot who, questioned whether converting Unicode text to NFC
    > would actually improve its compressibility, and asked if any actual data
    > was available.

    I was pretty sure converting to NFC would help compression (at least some of
    the time), I asked for data because the question of *how much* it would help
    compression was still open.

    > One extremely simple example would be text that consisted mostly of
    > Latin-1, but contained U+212B ANGSTROM SIGN and no other characters from
    > that block. By converting this character to its canonical equivalent
    > U+00C5:
    >
    > * UTF-8 would use 2 bytes instead of 3.
    > * SCSU would use 1 byte instead of 2.
    > * BOCU-1 would use 1 or 2 bytes instead of always using 2.

    However if the same text contained U+212B but didn't contain U+00C5 then some
    forms of compression should give the same results (e.g. if you calculated
    Huffman tokens over Unicode characters the results would be identical except
    for the table).

    > This file is in EUC-KR, but can easily be converted to Unicode using
    > recode, SC UniPad, or another converter. It consists of 3,317,215
    > Unicode characters, over 96% Hangul syllables and Basic Latin spaces,
    > full stops, and CRLFs. When broken down into jamos (i.e. converting
    > from NFC to NFD), the character count increases to 6,468,728.
    >
    > The entropy of the syllables file is 6.729, yielding a "Huffman bit
    > count" of 22.3 million bits. That's the theoretical minimum number of
    > bits that could be used to encode this file, character by character,
    > assuming a Huffman or arithmetic coding scheme designed to handle 16- or
    > 32-bit Unicode characters. (Many general-purpose compression algorithms
    > can do better.) The entropy of the jamos file is 4.925, yielding a
    > Huffman bit count of 31.8 million bits, almost 43% larger.
    >
    > When encoded in UTF-8, SCSU, or BOCU-1, the syllables file is smaller
    > than the jamos file by 55%, 17%, and 32% respectively.
    >
    > General-purpose algorithms tend to reduce the difference, but PKZip
    > (using deflate) compresses the syllables file to an output 9% smaller
    > than that of the jamos file. Using bzip2, the compressed syllables file
    > is 2% smaller.

    2% isn't much.

    Further, a Unicode-aware algorithm would expect a choseong character to be
    followed by a jungseong and a jongseong to follow a jungsong, and could
    essentially perform the same benefits to compression that normalising to NFC
    perfroms but without making an irreversible change (i.e. it could tokenise the
    Jamo sequences rather than normalising and then tokenising). As such I'd say
    the question of how much compression can benefit from normalisation is still
    open.

    > Whether a "silent" normalization to NFC can be a legitimate part of
    > Unicode compression remains in question. I notice the list is still
    > split as to whether this process "changes" the text (because checksums
    > will differ) or not (because C10 says processes must consider the text
    > to be equivalent).

    I think practical uses will continue to be split on this as well, and as such
    any normalising compression system will not be applicable to all uses. Of
    course that answers the question "should we normalise?" with the
    question "should we have a compression scheme that isn't universally
    applicable?"

    --
    Jon Hanna
    <http://www.hackcraft.net/>
    *Thought provoking quote goes here*
    


    This archive was generated by hypermail 2.1.5 : Mon Dec 01 2003 - 06:31:05 EST