RE: Compression - binary ordered

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Fri Jun 01 2001 - 12:26:20 EDT


Mark,

This sounds like a great idea. I was wondering however, if spaces in non-plane 0 characters set will cause problems with the compression efficiency. Maybe you should consider a special case for spaces.

Maybe you could use something like offsetting the displacement values to accommodate special markings.

Encoded Offset Actual Offset
+2 +1
+1 0
0 Space character
-1 Restart next character from offset 0 (Resync)
-2 -1
-3 -2

If nothing else it should give better Korean compression.

The Resync could be used prior to a null character as a string terminator. Nulls not preceded by a resync are not termination nulls.

This scheme would require a slight modification to comparison routines. However, you should still be able to compare without full decoding. Resync will cause a problem with compares. The space insertion will only require minor adjustments.

Carl

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Mark Davis
Sent: Thursday, May 31, 2001 11:27 PM
To: Unicode
Cc: Unicore
Subject: Compression - binary ordered

As a by-product of our recent work on collation, we developed a method of
Unicode compression that is similar to SCSU, in that small alphabets are
about a byte per character and large alphabets are about two bytes per
character.

The main difference from SCSU is that this method preserves binary order. As
this is a hot topic right now, I thought it might be of interest. The latest
draft description is on http://oss.software.ibm.com/icu/develop/bocu.htm.
Comments are welcome.

Mark
—————

πάντων µέτρον ἄνθρωπος — Πρωταγόρας

[http://www.macchiato.com]



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT