Re: Reuters Compression Scheme for Unicode (RCSU)

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jul 01 1997 - 15:27:50 EDT


Dear Mr. Williams,

> I was reviewing RCSU paper from the UIC-9 proceedings.
>(I did not attend UIC-10 and do not have those proceedings).
>I was wondering if anyone has any stats on UTF-8 for comparison
>purposes? How does LZW and RCSU do compared to UTF-8 in
>terms of speed?
>
> Does anyone have any data on the size of UTF-8 vs Unicode? I realize
>that UTF-8 will be 50% in size for characters in the 7-bit ASCII range
>and that Asian scripts with pure DBCS characters will be 150% in
>size for UTF-8. It appears RCSU paper has an idea of typical data,
>so how does that typical data measure up in size with UTF-8?
>I assume that RCSU authors have some idea of "typical data" and thus
>why they were able to conclude that UTF-8 was not good enough for
>their purposes.
>
> Thanks in advance.
>Randy

I don't have statistical data for actual texts in hand, but
I can take a crack at quantifying this.

Here are the significant classes of size reduction/size expansion
for UTF-8.

1. 7-bit ASCII (English only, no non-ASCII punctuation, e.g.
     typical email content on this list)

     Unicode --> UTF-8 50% size reduction (equal to RCSU)

2. Non-Latin scripts coded mostly before U+07FF (Greek,
     Cyrillic, Armenian, Hebrew, Arabic)

     Unicode --> UTF-8 no size change (roughly)

3. All other scripts (Indic, CJK, etc.)

     Unicode --> UTF-8 50% size expansion (roughly)

Cases 2 and 3 are clearly much worse behavior than RCSU, which
provides significant compression except for Chinese and Korean.

The only instance where you have to have to apply some statistics
regarding text frequency is the remaining case:

4. General Latin script (including accented characters, full punctuation,
     other symbols, etc.)

For this, maybe somebody will want to do the text analysis, but I
would ballpark it roughly as follows:

4a. English text (with occasional non-ASCII punctuation such as
     directed quotes, dashes, some accented characters, occasional
     symbols, etc.)

     Estimate: 2-3% non-ASCII characters.

     Unicode --> UTF-8 maybe 5-6% size expansion (and comparable
                             size to what an RCSU compression would
                             accomplish)

4b. Text for a typical European language with accented characters.

     Estimate: 5-10% non-ASCII characters (depending on language)

     Unicode --> UTF-8 maybe 11-22% size expansion (and not as good
                             as what an RCSU compression would accomplish)

Anybody want to quantify these guesses further?

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT