Re: Compression through normalization

From: jon@hackcraft.net
Date: Mon Dec 01 2003 - 05:50:28 EST

Next message: Michael Everson: "RE: Oriya: mba / mwa ?"

Previous message: Arcane Jill: "RE: numeric properties of Nl characters in the UCD"
Next in thread: Philippe Verdy: "RE: Compression through normalization"
Reply: Philippe Verdy: "RE: Compression through normalization"
Maybe reply: Kenneth Whistler: "Re: Compression through normalization"
Maybe reply: Kenneth Whistler: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Quoting Doug Ewell <dewell@adelphia.net>:

> Someone, I forgot who, questioned whether converting Unicode text to NFC
> would actually improve its compressibility, and asked if any actual data
> was available.

I was pretty sure converting to NFC would help compression (at least some of
the time), I asked for data because the question of *how much* it would help
compression was still open.

> One extremely simple example would be text that consisted mostly of
> Latin-1, but contained U+212B ANGSTROM SIGN and no other characters from
> that block. By converting this character to its canonical equivalent
> U+00C5:
>
> * UTF-8 would use 2 bytes instead of 3.
> * SCSU would use 1 byte instead of 2.
> * BOCU-1 would use 1 or 2 bytes instead of always using 2.

However if the same text contained U+212B but didn't contain U+00C5 then some
forms of compression should give the same results (e.g. if you calculated
Huffman tokens over Unicode characters the results would be identical except
for the table).

> This file is in EUC-KR, but can easily be converted to Unicode using
> recode, SC UniPad, or another converter. It consists of 3,317,215
> Unicode characters, over 96% Hangul syllables and Basic Latin spaces,
> full stops, and CRLFs. When broken down into jamos (i.e. converting
> from NFC to NFD), the character count increases to 6,468,728.
>
> The entropy of the syllables file is 6.729, yielding a "Huffman bit
> count" of 22.3 million bits. That's the theoretical minimum number of
> bits that could be used to encode this file, character by character,
> assuming a Huffman or arithmetic coding scheme designed to handle 16- or
> 32-bit Unicode characters. (Many general-purpose compression algorithms
> can do better.) The entropy of the jamos file is 4.925, yielding a
> Huffman bit count of 31.8 million bits, almost 43% larger.
>
> When encoded in UTF-8, SCSU, or BOCU-1, the syllables file is smaller
> than the jamos file by 55%, 17%, and 32% respectively.
>
> General-purpose algorithms tend to reduce the difference, but PKZip
> (using deflate) compresses the syllables file to an output 9% smaller
> than that of the jamos file. Using bzip2, the compressed syllables file
> is 2% smaller.

2% isn't much.

Further, a Unicode-aware algorithm would expect a choseong character to be
followed by a jungseong and a jongseong to follow a jungsong, and could
essentially perform the same benefits to compression that normalising to NFC
perfroms but without making an irreversible change (i.e. it could tokenise the
Jamo sequences rather than normalising and then tokenising). As such I'd say
the question of how much compression can benefit from normalisation is still
open.

> Whether a "silent" normalization to NFC can be a legitimate part of
> Unicode compression remains in question. I notice the list is still
> split as to whether this process "changes" the text (because checksums
> will differ) or not (because C10 says processes must consider the text
> to be equivalent).

I think practical uses will continue to be split on this as well, and as such
any normalising compression system will not be applicable to all uses. Of
course that answers the question "should we normalise?" with the
question "should we have a compression scheme that isn't universally
applicable?"

--
Jon Hanna
<http://www.hackcraft.net/>
*Thought provoking quote goes here*

Next message: Michael Everson: "RE: Oriya: mba / mwa ?"
Previous message: Arcane Jill: "RE: numeric properties of Nl characters in the UCD"
Next in thread: Philippe Verdy: "RE: Compression through normalization"
Reply: Philippe Verdy: "RE: Compression through normalization"
Maybe reply: Kenneth Whistler: "Re: Compression through normalization"
Maybe reply: Kenneth Whistler: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Dec 01 2003 - 06:31:05 EST