Re: Compression of Unicode Strings

From: Misha Wolf (MISHA.WOLF@reuters.com)
Date: Mon Sep 18 1995 - 18:57:10 EDT


Tim,

Asmus copied me on his reply (see below) to your question (also see below).
The Reuter Compression Scheme for Unicode might suit your requirements. The
scheme uses:

- one octet per character for languages using an alphabetic script,

- two octets per character for languages using an ideographic script.

A quick summary of this scheme follows:

- By default, all the ISO 8859-1 (ie Latin-1) characters use 1 octet each -
   the same octet as in ISO 8859-1. For example:

      LATIN CAPITAL LETTER A = 41

      LATIN CAPITAL LETTER A WITH GRAVE = C0.

- If a different alphabet is being used, eg Cyrillic, then the octet range
   00-7F is used to encode the ASCII characters and the octet range 80-FF is
   used to encode the characters of the other alphabet (eg Cyrillic). This
   is familiar to users of ISO 2022 and ISO 8859-X.

- Mapping functions are provided to:

   - Single Shift to UCS-2 (ie 16-bit Unicode)

   - Locking Shift to UCS-2

   - Locking Shift back to the Single Octet Format outlined above.

The scheme works as follows:

- In Single Octet Format, the 256 octet values are split into two equally-
   sized windows, of 128 code points each. The first window always covers
   the code points 0000-007F (ie the BASIC LATIN, aka ASCII, block).

- The start of the second (sliding) window may be moved to any code point
   which is a multiple of 10 (hex), but defaults to code point 0080. It is
   this default position that causes the ISO 8859-1 characters to be encoded
   using the same single octet as in ISO 8859-1.

- Three mapping functions are provided. These are described below.

- The octet 0E is used to encode a Single Shift to UCS-2.

- The octet 0F is used to encode a Locking Shift to UCS-2.

- The octet pair Ennn is used to encode:

   - a Locking Shift to Single Octet Format, together with

   - an instruction to move the sliding window.

- The Private Use Area value Ennn is derived as follows:

   - Take the desired starting position of the sliding window.

   - Strip off the trailing "0".

   - Prefix with "E".

- For example, halfwidth Katakana starts at FF60. To set the sliding
   window to this position one uses the mapping function EFF6.

- To move the sliding window while in Single Octet Format, one prefixes the
   Ennn function with a Single Shift to UCS-2, for example:

      0E EFF6.

Please let me know whether this compression scheme is of use to you.

Regards,
Misha (misha.wolf@reuters.com)

---

At the recent Unicode Conference, we had a participant from Nokia who was in favor of the same idea. Unfortunately I don't have his name, but I assume that you would be able to locate him through channels in your industry.

As for a compression scheme, contact Misha.wolf@reuters.com

They had similar 'space neutral' issues, and created a 'sliding window' form of compression which is able to represent all of Unicode. He may be willing to share this with you.

For short strings, regular compression schemes don't work too well as there's not enough context. So, sliding window schemes use 1 or 2 byte selectors for 128 character windows which can be positioned to end up using the shortest code.

The decoder works in a very simple manner, just doing offset calculations and does *not* need large tables or work-space. The encoder may get better results if it uses more resources to optimize the encoding.

A./

Could someone give me some help with compression of Unicode strings. I am working on a proposal to ETSI (European Telecommunications Standards Institute) to support Unicode on GSM (Global System for Mobile Communications) Cellular phones. These phones are digital cellular phones with the capability of transmitting short text messages (approximately 140 bytes). Currently GSM uses a 127 bit specialized character set to transmit these messages. This has obviously proved inadequate as the GSM system has been adopted by over 60 countries (and growing rapidly) including, China, Hong Kong, India, Middle eastern countries, all of Europe, Russia, and limited deployment in Canada, Mexico, and the U.S., and others.

Currently there is a proposal before ETSI to adopt T.51, which is based on ISO 2022. I am desperately trying to kill this proposal in favor of Unicode, but there is a lot of political inertia in ETSI. Probably the major concern about adopting Unicode is the 16-bit representation of characters is too space inefficient. I am trying to explain that a separate compression technique on top of Unicode is a better solution than the T.51 proposal.

This is where I would like help. The current 7-bit encoding allows 160 characters to be transmitted in 140 bytes. My goal is to suggest a simple compression technique that will allow at least 160 Unicode characters to be compressed in to 140 bytes. This compression technique must be something that can be implemented by machines, e.g., hand held cellular phones, that have very few computing resources. Typically a phone would be able to dedicate no more than 1K of RAM and approximately 10k of Code space to such an algorithm.

If anyone knows of a compression technique that would fit this bill I would much appreciate them letting me know about it.

Also if anyone has any cogent arguments that clearly demonstrate the benefits of Unicode over ISO 2022, I would much appreciate them forwarded to me as well.

Thanks for your time and help.

Regards,

Tim Garton

--

Tim Garton Engineering Section Manager Motorola Inc. GSM Subscriber Engineering Phone: (708) 523-7790 E-mail: timg@eurpd.csg.mot.com Fax: (708) 523-2545 Pager: 9404



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT