Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Nov 23 2003 - 01:53:10 EST

  • Next message: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"

    Jungshik Shin <jshin at mailaps dot org> wrote:

    >> The file they used, called "arirang.txt," contains over 3.3 million
    >> Unicode characters and was apparently once part of their "Florida
    >> Tech Corpus of Multi-Lingual Text" but subsequently deleted for
    >> reasons not known to me. I can supply it if you're interested.
    >
    > It'd be great if you could.

    Try
    http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt
    first. If that doesn't work, I'll send you a copy. It's over 5
    megabytes, so I'd like to avoid that if possible.

    >> The statistics on this file are as follows:
    >>
    >> UTF-16 6,634,430 bytes
    >> UTF-8 7,637,601 bytes
    >> SCSU 6,414,319 bytes
    >> BOCU-1 5,897,258 bytes
    >> Legacy encoding (*) 5,477,432 bytes
    >> (*) KS C 5601, KS X 1001, or EUC-KR)
    >
    > Sorry to pick on this (when I have to thank you). Even with
    > coded character set vs character encoding scheme distinction aside
    > (that is, we just think in terms of character repertoire), KS C 5601/
    > KS X 1001 _alone_ cannot represent any Korean text unless you're
    > willing to live with double width space, Latin letters, numbers and
    > punctuations (since you wrote the file has apparently full stops and
    > spaces in ASCII, it does include characters outside KS X 1001) On the
    > other hand, EUC-KR (KS X 1001 + ISO 646:KR/US-ASCII) can. Actually, I
    > suspect the legacy encoding used was Windows codepage 949(or JOHAB/
    > Windows-1361?) because I can't imagine there is not a single syllable
    > (that is outside the charater repertoire of KS X 1001) out of over 2
    > million syllables

    Sorry, I should have noticed on Atkin and Stansifer's data page
    (http://www.cs.fit.edu/~ryan/compress/) that the file is in EUC-KR. All
    I knew was that I was able to import it into SC UniPad using the option
    marked "KS C 5601 / KS X 1001, EUC-KR (Korean)".

    >> I used my own SCSU encoder to achieve these results, but it really
    >> wouldn't matter which was chosen -- Korean syllables can be encoded
    >> in SCSU *only* by using Unicode mode. It's not possible to set a
    >> window to the Korean syllable range.
    >
    > Now that you told me you used NFC, isn't this condition similar to
    > Chinese text? How does BOCU and SCSU work for Chinese text? Japanese
    > text might do slightly better with Kana, but isn't likely to be much
    > better.

    Well, *I* didn't use NFC for anything. That's just how the file came to
    me. And yes, the situation is exactly the same for Chinese text, except
    I suppose that with 20,000-some basic Unihan characters, plus Extension
    A and B, plus the compatibility guys starting at U+F900, one might not
    realistically expect any better than 16 bits per character. OTOH, when
    dealing with 11,171 Hangul syllables interspersed with Basic Latin, I
    imagine there is some room for improvement over UTF-16.

    I'm intrigued by the improved performance of BOCU-1 on Korean text, and
    I'm now interested in finding a way to achieve even better compression
    of Hangul syllables, using a strategy *not* much more complex than SCSU
    or BOCU and *not* involving huge reordering tables. Your assistance,
    and anyone else's, would be welcome. Googling for "Korean compression"
    or "Hang[e]ul compression" turns up practically nothing, so there is a
    chance to break some new ground here.

    John Cowan <cowan at mercury dot ccil dot org> responded to Jungshik's
    comment about Kana:

    > The SCSU paper claims that Japanese does *much* better in SCSU than
    > UTF-16, thanks to the kana.

    The example in Section 9.3 would appear to substantiate that claim, as
    116 Unicode characters (= 232 bytes of UTF-16) are compressed to 178
    bytes of SCSU.

    Back to Jungshik:

    >> Only the large number of spaces and full
    >> stops in this file prevented SCSU from degenerating entirely to 2
    >> bytes per character.
    >
    > That's why I asked. What I'm curious about is how SCSU and BOCU
    > of NFD (and what I and Kent [2] think should have been NFD with the
    > possible code point rearragement of Jamo block to facilate a smaller
    > window size for SCSU) would compare with uncompressed UTF-16 of NFC
    > (SCSU/BOCU isn't much better than UTF-16). The back of an envelope
    > calculation gives me 2.5 ~ 3 bytes per syllable (without the code
    > point rearrangement to put them within a 64 character-long window [1])
    > so it's still worse than UTF-16. However, that's not as bad as ~5
    > bytes (or more) per syllable without SCSU/BOCU-1. I have to confess
    > that I just have a very cursory understanding of SCSU/BOCU-1.

    When this file is broken down into jamos (NFD), SCSU regains its
    supremacy:

    UTF-8: 17,092,140 bytes
    BOCU-1: 8,728,553 bytes
    SCSU: 7,750,957 bytes

    And you are correct that SCSU (and for that matter, BOCU-1) performance
    would have been better if the jamos used in modern Korean had been
    arranged to fit in a 128-character window (64 would not have been
    necessary). As it is, SCSU does have to do some switching between the
    two windows. Of course, no compression format applied to jamos could
    even do as well as UTF-16 applied to syllables, i.e. 2 bytes per
    syllable.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Sun Nov 23 2003 - 02:42:15 EST