Dear Mr. Williams,
> I was reviewing RCSU paper from the UIC-9 proceedings.
>(I did not attend UIC-10 and do not have those proceedings).
>I was wondering if anyone has any stats on UTF-8 for comparison
>purposes? How does LZW and RCSU do compared to UTF-8 in
>terms of speed?
> Does anyone have any data on the size of UTF-8 vs Unicode? I realize
>that UTF-8 will be 50% in size for characters in the 7-bit ASCII range
>and that Asian scripts with pure DBCS characters will be 150% in
>size for UTF-8. It appears RCSU paper has an idea of typical data,
>so how does that typical data measure up in size with UTF-8?
>I assume that RCSU authors have some idea of "typical data" and thus
>why they were able to conclude that UTF-8 was not good enough for
> Thanks in advance.
I don't have statistical data for actual texts in hand, but
I can take a crack at quantifying this.
Here are the significant classes of size reduction/size expansion
1. 7-bit ASCII (English only, no non-ASCII punctuation, e.g.
typical email content on this list)
Unicode --> UTF-8 50% size reduction (equal to RCSU)
2. Non-Latin scripts coded mostly before U+07FF (Greek,
Cyrillic, Armenian, Hebrew, Arabic)
Unicode --> UTF-8 no size change (roughly)
3. All other scripts (Indic, CJK, etc.)
Unicode --> UTF-8 50% size expansion (roughly)
Cases 2 and 3 are clearly much worse behavior than RCSU, which
provides significant compression except for Chinese and Korean.
The only instance where you have to have to apply some statistics
regarding text frequency is the remaining case:
4. General Latin script (including accented characters, full punctuation,
other symbols, etc.)
For this, maybe somebody will want to do the text analysis, but I
would ballpark it roughly as follows:
4a. English text (with occasional non-ASCII punctuation such as
directed quotes, dashes, some accented characters, occasional
Estimate: 2-3% non-ASCII characters.
Unicode --> UTF-8 maybe 5-6% size expansion (and comparable
size to what an RCSU compression would
4b. Text for a typical European language with accented characters.
Estimate: 5-10% non-ASCII characters (depending on language)
Unicode --> UTF-8 maybe 11-22% size expansion (and not as good
as what an RCSU compression would accomplish)
Anybody want to quantify these guesses further?
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT