Re: Compression through normalization

From: Jungshik Shin (jshin@mailaps.org)
Date: Sun Nov 30 2003 - 07:14:56 EST

  • Next message: Philippe Verdy: "RE: Brahmic list ? (was: Oriya: mba / mwa ?)"

    On Sat, 29 Nov 2003, Doug Ewell wrote:

    > A longer and more realistic case can be seen in the sample Korean file
    > at:
    >
    > http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt

    I finally downloaded the file and took a look at it. I was surprised
    to find that the text is the entire content of the volume 1 of a famous
    Korean novel (Arirang) by a _living_ Korean writer CHO Chongrae (published
    in the early 1990's). This seems to be problematic because it's clearly
    copyrighted and I don't see any mention of having obtained the permission
    from the author/the publisher. Using the text for writing the paper may
    be all right, but putting it up at the web for everyone to download is
    not (afaik).

    > This file is in EUC-KR, but can easily be converted to Unicode using

       I read the novel (almost 10 years ago) and found a lot of Hangul
    syllables NOT covered by KS X 1001 (one of two CCS' comprising EUC-KR
    along with US-ASCII/ISO 646:KR). [1] The novel has a large amount of
    faithful transcription of Cholla (South-Western) dialect of Korean and
    it's all but impossible to do that within the character repertoire
    of KS X 1001. So, I was curious as to what they did in ariang.txt
    (because iconv(3) didn't detect any invalid byte sequence when I used
    it to convert to UTF-8 from EUC-KR). It turned out that they replaced
    all Hangul syllables outside KS X 1001 by either ASCII space or the
    first Hangul compatibility Jamo of syllables in arirang.txt they put
    up at www.cs.fit.edu in EUC-KR. They should have used UTF-8 from the
    beginning. It wouldn't have changed their result very signficantly,
    but still would have given them slightly different numbers.

    > can do better.) The entropy of the jamos file is 4.925, yielding a
    > Huffman bit count of 31.8 million bits, almost 43% larger.

    > When encoded in UTF-8, SCSU, or BOCU-1, the syllables file is smaller
    > than the jamos file by 55%, 17%, and 32% respectively.

    You wrote earlier the following. In terms of the number of Unicode
    characters, going to NFD increases the size almost by 100%.

    > 3,317,215 Unicode characters, over 96% Hangul syllables and Basic
    > Latin spaces, full stops, and CRLFs. When broken down into jamos
    > (i.e. converting from NFC to NFD), the character count increases to
    > 6,468,728.

    So, I was a bit confused by your 55% for a moment or two until I realized
    that the reference is the other way around (because you're talking about
    the compression via normalization, which is different from my main reason
    I'm interested in the issue). So, NFD text (in UTF-8) is about twice
    as long as NFC text (in UTF-8). That's not so bad as a simple back of
    envelope calculation suggests. NFD text in SCSU and BOCU-1 are _only_
    20% and 47% longer than NFC text in SCSU and BOCU-1. This is even better.

    > General-purpose algorithms tend to reduce the difference, but PKZip
    > (using deflate) compresses the syllables file to an output 9% smaller
    > than that of the jamos file. Using bzip2, the compressed syllables file
    > is 2% smaller.

      bzip2 is wonderful ! With bzip2 narrowing the 'gulf' to ~ 2%
    and pkzip to ~ 11%, 'proponents' of using Hangul letters over Hangul
    syllables has another good argument as to why Hangul letters be favored
    in representing Korena text. Thanks for the good news :-)

      Jungshik

    [1] Needless to say, when I read the novel, I didn't have the KS X 1001
    table by my side. However, it's easy for me to spot Hangul syllables
    not covered by KS X 1001. Besides, when I read the sequel to 'Arirang',
    Han-gang (Han-river) by the same author that appeared daily in Hangyoreh
    shinmun web site (http://www.hani.co.kr) a few years ago, Hangul syllables
    outside the KS X 1001 character repertoire were represented by sequences
    of Hangul Compatibility Jamos (U+3130) because the newspaper web site used
    (still does) EUC-KR. In every daily installement, there were at least
    several syllables represented that way.



    This archive was generated by hypermail 2.1.5 : Sun Nov 30 2003 - 07:51:45 EST