Re: Reuters Compression Scheme for Unicode (RCSU) [vs. UTF-8 vs. other compression schemes]

From: Alain LaBont\i SCT (alb@riq.qc.ca)
Date: Thu Jul 03 1997 - 01:04:02 EDT


A 12:28 01/07/97 -0700, Kenneth Whistler (UNICODE.ORG) a écrit :

[Randy] :
>> I was reviewing RCSU paper from the UIC-9 proceedings.
>>(I did not attend UIC-10 and do not have those proceedings).
>>I was wondering if anyone has any stats on UTF-8 for comparison
>>purposes? How does LZW and RCSU do compared to UTF-8 in
>>terms of speed?
>>
>> Does anyone have any data on the size of UTF-8 vs Unicode? I realize
>>that UTF-8 will be 50% in size for characters in the 7-bit ASCII range
>>[...]
>>It appears RCSU paper has an idea of typical data,
>>so how does that typical data measure up in size with UTF-8?
>>I assume that RCSU authors have some idea of "typical data" and thus
>>why they were able to conclude that UTF-8 was not good enough for
>>their purposes.
>>
>> Thanks in advance.
>>Randy
>
>[...]

[Kenneth] :
>I don't have statistical data for actual texts in hand, but
>I can take a crack at quantifying this.
>
>[...]
>
>4b. Text for a typical European language with accented characters.
>
> Estimate: 5-10% non-ASCII characters (depending on language)
>
> Unicode --> UTF-8 maybe 11-22% size expansion (and not as good
> as what an RCSU compression would accomplish)
>
>Anybody want to quantify these guesses further?

[Alain] :
I already made personal statistics about this *for French*. UTF-8 implies
an increase in storage relative to Latin-1 of roughly 3% according to a
sample of texts on my machines (to be prudent, double this to 6 % if you
wish, variance is roughly this one): that means 47% compression compared to
UNICODE (I don't know what it means relatively to Reuters Compression
Scheme for UNICODE [RCSU] as I don't know this scheme). I tended to be
against UTF-8 before and thought it was not necessary to go to that
complexity (why not directly go to fixed 16-bit UNICODE, with UCS-4
extensions if necessary, or better, did I think without any nuance at that
time, to UCS-4, and leave compression to be done by compression black boxes
[there will always be ones better than what was done yesterday!] ?)

But since I've seen these statistics which I was producing myself, honestly
in order to eventually prove that UTF-8 was bad (I had a prejudice against,
intuitively), the figures I got cooled me down a lot and I had to
objectively change my mind slightly or at least nuance it: UTF-8 could
objectively make sense for transmission (for reasons other than
compression, like eventually solving control [character and esoteric
command and filing] problems, I don't know), if not for processing, in
particular if it really spreads around the world with appropriate external
tags [hopefully MIME, over the Internet].

I fear bugs like bubonic plague, though, if UTF-8 is *kept* for processing
or even for storage (mainly search-bugs and inconsistencies in addition to
presentation problems). In French we say : "Chat échaudé craint l'eau
froide" [A cat which was burnt even fears cold water] - I have been burnt
[and I continue to be each day] by the horrifying QUOTED PRINTABLE [some of
my American friends righly say QUOTED UNREADABLE] scheme and the worst, RFC
1522.

But nevertheless my latest conclusion is that it is worth trying using
UTF-8, although it really puts English at an advantage (no change required
for 7-bit ASCII, not a sin per se btw), while putting an extra burden on
*all* other languages (which looks unfair at first sight, and that is at
least a venial sin! (; ). It is worth trying, if it is the price to pay to
at least really get universal email communications and if it is the price
to pay to get the general English-speaking world to switch to *8*-bit
*oct*ets (should be obvious, shouldn't it? (; ), the minimal compromise
we're requiring from them and that good will parties do gracefully already.

I would not be ready to pay a cent for UTF-7 though: UTF-7 would just
contribute to spread the great bubonic plague of *7*-bit-*oct*ets (already
requires 12.5% compression, doesn't it, why not use Baudot coding? (; ),
perhaps 80% of the world communication problems in email over the Internet,
problems which are not due to the packet transfer protocol, as we know, but
just to an unreasonable dogma in SMTP (fortunately more and more violated,
and it should be eradicated).

UTF-7 proposers had good intents, I'm absolutely sure, but as we also say
in French: "l'enfer est pavé de bonnes intentions" [the hell is paved with
good intents], that said with all due respect for the proposers, whose
contribution of course was a factor of progress (which I have respect for,
it is by its mistakes that mankind learns).

Now RCSU would perhaps gain to be known more but I fear that it won't be as
universal as UTF-8 because if compression is the only issue, there will
always be better schemes and one should be independent of these means (btw
could somebody entertain us about fractal compression, a panacea
apparently, in the compression world? I am impatient to hear about that, I
heard a rumour that it achieves 100:1 compression even on random binary
data, without *any* data loss, which is to me extremely hard to believe but
I am always open, after all the whole universe is a miracle in itself, and
I believe more and more that "the gods" created it in a completely
numerical form (; , we simply did not find all the keys yet even if
fractals are a great leap!). That said, black boxes do a great job for
compression (state-of-the-art high-speed modems being a good case in point,
more and more).

Alain LaBonté
Iraklion, Ellas



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:35 EDT