Re: Compression through normalization

From: Jungshik Shin (jshin@mailaps.org)
Date: Sun Nov 30 2003 - 07:14:56 EST

Next message: Philippe Verdy: "RE: Brahmic list ? (was: Oriya: mba / mwa ?)"

Previous message: Michael Everson: "RE: Oriya: mba / mwa ?"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Reply: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Sat, 29 Nov 2003, Doug Ewell wrote:

> A longer and more realistic case can be seen in the sample Korean file
> at:
>
> http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt

I finally downloaded the file and took a look at it. I was surprised
to find that the text is the entire content of the volume 1 of a famous
Korean novel (Arirang) by a _living_ Korean writer CHO Chongrae (published
in the early 1990's). This seems to be problematic because it's clearly
copyrighted and I don't see any mention of having obtained the permission
from the author/the publisher. Using the text for writing the paper may
be all right, but putting it up at the web for everyone to download is
not (afaik).

> This file is in EUC-KR, but can easily be converted to Unicode using

I read the novel (almost 10 years ago) and found a lot of Hangul
syllables NOT covered by KS X 1001 (one of two CCS' comprising EUC-KR
along with US-ASCII/ISO 646:KR). [1] The novel has a large amount of
faithful transcription of Cholla (South-Western) dialect of Korean and
it's all but impossible to do that within the character repertoire
of KS X 1001. So, I was curious as to what they did in ariang.txt
(because iconv(3) didn't detect any invalid byte sequence when I used
it to convert to UTF-8 from EUC-KR). It turned out that they replaced
all Hangul syllables outside KS X 1001 by either ASCII space or the
first Hangul compatibility Jamo of syllables in arirang.txt they put
up at www.cs.fit.edu in EUC-KR. They should have used UTF-8 from the
beginning. It wouldn't have changed their result very signficantly,
but still would have given them slightly different numbers.

> can do better.) The entropy of the jamos file is 4.925, yielding a
> Huffman bit count of 31.8 million bits, almost 43% larger.

> When encoded in UTF-8, SCSU, or BOCU-1, the syllables file is smaller
> than the jamos file by 55%, 17%, and 32% respectively.

You wrote earlier the following. In terms of the number of Unicode
characters, going to NFD increases the size almost by 100%.

> 3,317,215 Unicode characters, over 96% Hangul syllables and Basic
> Latin spaces, full stops, and CRLFs. When broken down into jamos
> (i.e. converting from NFC to NFD), the character count increases to
> 6,468,728.

So, I was a bit confused by your 55% for a moment or two until I realized
that the reference is the other way around (because you're talking about
the compression via normalization, which is different from my main reason
I'm interested in the issue). So, NFD text (in UTF-8) is about twice
as long as NFC text (in UTF-8). That's not so bad as a simple back of
envelope calculation suggests. NFD text in SCSU and BOCU-1 are _only_
20% and 47% longer than NFC text in SCSU and BOCU-1. This is even better.

> General-purpose algorithms tend to reduce the difference, but PKZip
> (using deflate) compresses the syllables file to an output 9% smaller
> than that of the jamos file. Using bzip2, the compressed syllables file
> is 2% smaller.

bzip2 is wonderful ! With bzip2 narrowing the 'gulf' to ~ 2%
and pkzip to ~ 11%, 'proponents' of using Hangul letters over Hangul
syllables has another good argument as to why Hangul letters be favored
in representing Korena text. Thanks for the good news :-)

Jungshik

[1] Needless to say, when I read the novel, I didn't have the KS X 1001
table by my side. However, it's easy for me to spot Hangul syllables
not covered by KS X 1001. Besides, when I read the sequel to 'Arirang',
Han-gang (Han-river) by the same author that appeared daily in Hangyoreh
shinmun web site (http://www.hani.co.kr) a few years ago, Hangul syllables
outside the KS X 1001 character repertoire were represented by sequences
of Hangul Compatibility Jamos (U+3130) because the newspaper web site used
(still does) EUC-KR. In every daily installement, there were at least
several syllables represented that way.

Next message: Philippe Verdy: "RE: Brahmic list ? (was: Oriya: mba / mwa ?)"
Previous message: Michael Everson: "RE: Oriya: mba / mwa ?"
In reply to: Doug Ewell: "Re: Compression through normalization"
Next in thread: Doug Ewell: "Re: Compression through normalization"
Reply: Doug Ewell: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Nov 30 2003 - 07:51:45 EST