Re: Compression through normalization

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Nov 30 2003 - 14:47:18 EST

  • Next message: Peter Constable: "RE: Oriya: mba / mwa ?"

    Jungshik Shin <jshin at mailaps dot org> wrote:

    > I finally downloaded the file and took a look at it. I was surprised
    > to find that the text is the entire content of the volume 1 of a
    > famous Korean novel (Arirang) by a _living_ Korean writer CHO Chongrae
    > (published in the early 1990's). This seems to be problematic because
    > it's clearly copyrighted and I don't see any mention of having
    > obtained the permission from the author/the publisher. Using the text
    > for writing the paper may be all right, but putting it up at the web
    > for everyone to download is not (afaik).

    Ah. That explains why it wasn't linked from any HTML page (forcing me
    to paw through their directories to find it), although the data was
    apparently used in writing the paper.

    I would not blame Atkin and Stansifer one bit for any possible copyright
    violation. They probably yanked the file as soon as they discovered it
    was protected.

    Nobody remembered or copied down the URL when I gave it (twice), did
    they? Good.

    > So, I was curious as to what they did in ariang.txt (because iconv(3)
    > didn't detect any invalid byte sequence when I used it to convert to
    > UTF-8 from EUC-KR). It turned out that they replaced all Hangul
    > syllables outside KS X 1001 by either ASCII space or the first Hangul
    > compatibility Jamo of syllables in arirang.txt they put up at
    > www.cs.fit.edu in EUC-KR. They should have used UTF-8 from the
    > beginning. It wouldn't have changed their result very signficantly,
    > but still would have given them slightly different numbers.

    That was the whole theme of their paper: find a text on the Internet,
    encoded in a legacy encoding; re-encode it into a variety of Unicode
    CES's (and some bizarre TES's like the entire Unicode character name);
    then compress all the forms using gzip and bzip2 and compare the
    results. This means they compared the compressibility of (for example)
    the Arirang file as encoded in EUR-KR against the same file encoded in
    UTF-8, SCSU, and other Unicode encodings. Not surprisingly, the
    legacy-encoded file was often the smallest. My paper argues that this
    is not a fair comparison.

    >> When encoded in UTF-8, SCSU, or BOCU-1, the syllables file is smaller
    >> than the jamos file by 55%, 17%, and 32% respectively.
    >
    > You wrote earlier the following. In terms of the number of Unicode
    > characters, going to NFD increases the size almost by 100%.
    >
    >> 3,317,215 Unicode characters, over 96% Hangul syllables and Basic
    >> Latin spaces, full stops, and CRLFs. When broken down into jamos
    >> (i.e. converting from NFC to NFD), the character count increases to
    >> 6,468,728.
    >
    > So, I was a bit confused by your 55% for a moment or two until I
    > realized that the reference is the other way around (because you're
    > talking about the compression via normalization, which is different
    > from my main reason I'm interested in the issue). So, NFD text (in
    > UTF-8) is about twice as long as NFC text (in UTF-8). That's not so
    > bad as a simple back of envelope calculation suggests. NFD text in
    > SCSU and BOCU-1 are _only_ 20% and 47% longer than NFC text in SCSU
    > and BOCU-1. This is even better.

    Sorry for the confusion. The missing link is that the number of Unicode
    characters in a text isn't necessarily the same as the number of code
    units needed to represent it. It is if you're using UTF-32, or in this
    BMP-only case, UTF-16.

    A file containing nothing but Hangul syllables, converted to jamos
    (NFD), would expand to between 2 and 3 times the original number of
    Unicode characters (because each syllable expands to 2 or 3 jamos).
    Spaces and full stops and CRLFs reduce this.

    > bzip2 is wonderful ! With bzip2 narrowing the 'gulf' to ~ 2% and pkzip
    > to ~ 11%, 'proponents' of using Hangul letters over Hangul syllables
    > has another good argument as to why Hangul letters be favored in
    > representing Korena text. Thanks for the good news :-)

    Yet another issue covered in my paper. I'm REALLY REALLY hoping the
    last reviewer gives me some feedback soon so I can release it. But I do
    think I'm going to have to add a small section on this normalization
    issue.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sun Nov 30 2003 - 15:45:29 EST