Re: Compression of Unicode Strings

From: Asmus Freytag (
Date: Mon Sep 18 1995 - 11:50:56 EDT

To summarize your mail:
>I am working on a proposal to support Unicode on
>digital cellular phones with the capability of
>transmitting short text messages (approximately 140 bytes).
> Currently GSM uses a 127 bit specialized character set to
> transmit these messages, which allows 160 characters to be
> transmitted in 140 bytes. My goal is to suggest a simple
>compression technique that will allow at least 160 Unicode
> characters to be compressed in to 140 bytes. It must be
>something that can be implemented by machines, that have very
>few computing resources. Typically a phone would be able to
> dedicate no more than 1K of RAM and approximately 10k of Code
> space to such an algorithm.

The requirements stated are for a scheme that is able to represent
all of Unicode and at the same time not use more space than the
previous scheme (at the very least not for the same characters).
Ideally, it should also be as compact as competing encodings.

For this kind of situation (short strings, limited processing
power) regular compression algorithms do poorly, because they
require context and or large tables. For this set of requirements
a simple 'sliding window' approach is preferrable. Effectively,
this amounts to run-length encoding the top byte (or top n bits).
The decoder is very simple and needs only do offset calculations
(a few lines of C code). The encoder can improve results by deeper
look-ahead and by exploring alternate encodings. But it can be
kept simple as well, sacrificing some efficiency.

> I am trying to explain that a separate compression technique on
>top of Unicode is a better solution than the T.51 proposal.

There are a number of very sinmple reasons for why your scheme is
preferrable. The most important is that it would be able to handle
all the characters that are in Unicode and preserve their identity.
As Unicode is quickly becoming the character set of choice on the
desktop, it will make it a lot easier for you to tie PC's and phones
together. T.51 is not natively supported in a widespread fashion and
to convert between it and Unicode requires tables. The second reason
is that your onerous bandwidth limitations may not last forever, at
which point you may want to switch to native Unicode.

Some Unicode implementers have had to address this before. Unicode
should look into the possibility of making recommendations for this
case. Better to have fewer schemes than more. Also, better to have
a scheme that expresses ALL of Unicode, (as the UTFs do) than the
use of alternate encodings with different repertoires.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT