Re: Compression through normalization

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Nov 29 2003 - 17:17:10 EST

  • Next message: Michael Everson: "RE: Oriya: mba / mwa ?"

    Someone, I forgot who, questioned whether converting Unicode text to NFC
    would actually improve its compressibility, and asked if any actual data
    was available.

    Certainly there is no guarantee that normalization would *always* result
    in a smaller file. A compressor that took advantage of normalization
    would have to determine whether there would be any benefit.

    One extremely simple example would be text that consisted mostly of
    Latin-1, but contained U+212B ANGSTROM SIGN and no other characters from
    that block. By converting this character to its canonical equivalent
    U+00C5:

    * UTF-8 would use 2 bytes instead of 3.
    * SCSU would use 1 byte instead of 2.
    * BOCU-1 would use 1 or 2 bytes instead of always using 2.

    A longer and more realistic case can be seen in the sample Korean file
    at:

    http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt

    This file is in EUC-KR, but can easily be converted to Unicode using
    recode, SC UniPad, or another converter. It consists of 3,317,215
    Unicode characters, over 96% Hangul syllables and Basic Latin spaces,
    full stops, and CRLFs. When broken down into jamos (i.e. converting
    from NFC to NFD), the character count increases to 6,468,728.

    The entropy of the syllables file is 6.729, yielding a "Huffman bit
    count" of 22.3 million bits. That's the theoretical minimum number of
    bits that could be used to encode this file, character by character,
    assuming a Huffman or arithmetic coding scheme designed to handle 16- or
    32-bit Unicode characters. (Many general-purpose compression algorithms
    can do better.) The entropy of the jamos file is 4.925, yielding a
    Huffman bit count of 31.8 million bits, almost 43% larger.

    When encoded in UTF-8, SCSU, or BOCU-1, the syllables file is smaller
    than the jamos file by 55%, 17%, and 32% respectively.

    General-purpose algorithms tend to reduce the difference, but PKZip
    (using deflate) compresses the syllables file to an output 9% smaller
    than that of the jamos file. Using bzip2, the compressed syllables file
    is 2% smaller.

    So we can at least say that Korean, which can be normalized from NFD to
    NFC algorithmically and without the use of long tables of equivalents or
    exclusions, can consistently be compressed to a smaller size after such
    normalization than before.

    Whether a "silent" normalization to NFC can be a legitimate part of
    Unicode compression remains in question. I notice the list is still
    split as to whether this process "changes" the text (because checksums
    will differ) or not (because C10 says processes must consider the text
    to be equivalent).

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sat Nov 29 2003 - 17:57:56 EST