Version |
1.4 | |
Author |
Mark Davis | |
Title |
Simple Unicode Text Compression Format |
The following presents a concrete proposal for a simple text compression scheme for Unicode, embodying ideas already presented by Misha Wolf, plus further refinements from Ken Whistler and myself. The goal is to have a simple, standard way of compressing Uniocde that can be standard across different platforms, much like UTF-8.
Note: this compresses Unicode text into 8-bit binary data. This data is a compression format, NOT an encoding format.
Must be very fast; shouldn't degrade performance significantly for compression OR decompression. Decompression is the most crucial; compression performance could be traded off for quality.
Should work for small strings as well as large. Tuned to Unicod e allocation and expected useage: longish runs of characters from a single script, with some intermixed symbols or punctuation. Doesn't have to work for random character strings, or be a general-purpose compression algorithm.
Should be simple code, easy to implement and maintain, fast to download.
Should leave large character sets (Han, Hangul) at about 2 bytes/char, and small character sets (English, French, Arabic) at about 1 byte/char. A Latin-1 sequence, in particular, just encodes as Latin-1.
Every Unicode character can be represented, and no characters are lost in compression/decompression.
The output could be further compressed by standard byte-based Huffman or LZW algorithms.
For performance reasons, we restrict ourselves to simple byte encodings, and not bit-twiddles. We can do a fast switch on certain "tag" byte values which are chosen to be in a small contiguous range, then have tight loops with simple exit criteria.
There are two modes: single-byte mode, and raw Unicode mode. Single byte mode treats bytes of less than 80 as ASCII (except for tags). Bytes over 80 are shifted according to a current "window" offset, to map onto 80 bytes somewhere else in Unicode. There are separate defined offsets that can be switched to, called window-0, window-1,etc. There are restrictions to the windows; we do not attempt to map Han or Hangul characters because their distribution characteristics do not allow frequent grouping within small blocks.
We choose the tags to be from 00 to 1F, skipping common controls like CR. NOTE: since this is a compression format, and not interworkable as-is with another character encoding, any infrequent characters could be chosen!
The tags within raw Unicode mode are chosen from the private use area , since they are not in frequent use.
There are 8 dynamic windows accessable from different controls. In addition, there are 8 static windows that are only accessable with SQn. These are used to quote common punctuation characters, ones that are not typically found in runs. Window 8 is used to quote a tag byte.
In the table below, hbyte is the high byte of a Unicode character (e.g. (unichar >> 8) & 0xFF), and lbyte is the low byte of a Unicode character (e.g. unichar & 0xFF)
Tag |
Arguments |
Function | ||
---|---|---|---|---|
SQU |
hbyte, lbyte |
Quote a Unicode character (e.g. single-shift Unicode)
| ||
SCU |
|
Change to Unicode mode. (e.g. locking-shift Unicode) | ||
SQn |
byte |
Quote a character from window n. (e.g. single-shift W0) | ||
SCn |
|
Change to window n. (e.g. locking-shift W0) | ||
SDn |
byte |
Define window n as OffsetTable[byte], and change to it. | ||
SKU |
|
<change to compatibility Unicode mode> |
Tag |
Arguments |
Function | ||
---|---|---|---|---|
UQU |
hbyte, lbyte |
Quote a Unicode character (e.g. single-shift) | ||
UC0 |
|
Change to Single-Mode, window n (e.g. locking-shift) | ||
UD0 |
byte |
Define window n as OffsetTable[byte], and change to it. |
This mode does not use tags, instead, any Unicode character T from E000 to EFFF terminates the mode. This termination character T also redefines Window 0, using an offset of ((T << 4) & 0xFFF0). (The mask is unnecessary on machines with 16bit unsigned shorts)
Surrogates can be encoded in Unicode mode. In single-mode, they can be quoted as two characters: SQU hbyte1 lbyte1 SQU hbyte2 lbyte2.
Issue : We could recognize SQU hbyte1 lbyte1 hbyte2 lbyte2 as a special case. This makes quoted surrogates slightly smaller, but does complicate the algorithm slightly.
Issue : There is no provision for setting a window to a surrogate range. We could use one new tag to do so, e.g. SSD x, y, where the top 3 bits of x are the window, and the remaining bits of x and y indicate a multiple of 0x80 to be added to 0x10000.
The format assumes certain default settings for the windows; these allow use of the windows without predefining them, saving a few bytes for common cases. The choice of offsets is not meant to put other languages at a disadvantage; they are to enable handling most languages by requiring at most the definition of one extra window, at a cost of a single byte.
You always start out in Window 0, so a string of pure Latin-1 graphic characters is converted with no change.
Window |
Major Area Covered |
Starting Offset |
---|---|---|
0 |
Latin-1 |
0x0080 |
1 |
Latin-1 Extended A |
0x0100 |
2 |
Hiragana |
0x3040 |
3 |
Katakana |
0x3080 |
4 |
Fullwidth Variants |
0xFF00 |
5 |
Halfwidth Kana |
0xFF60 |
6 |
Greek |
0x0370 |
7 |
Cyrillic |
0x0400 |
Window |
Major Area Covered |
Starting Offset |
---|---|---|
8 |
ASCII (for quoted tags |
0x0000 |
9 |
Latin-1 |
0x0080 |
10 |
Combining Diacriticals |
0x0300 |
11 |
General Punctuation |
0x2000 |
12 |
Currency Symbols |
0x2080 |
13 |
CJK Symbols & Punct |
0x3000 |
14 |
tbd |
tbd |
15 |
tbd |
tbd |
S- U-
xxx 00 n/a
-Q0 01 n/a
-Q1 02 n/a
-Q2 03 n/a
-Q3 04 n/a
-Q4 05 n/a
-Q5 06 n/a
-Q6 07 n/a
-Q7 08 n/a
xxx 09 n/a
xxx 0A n/a
-CU 0B n/a
xxx 0C n/a
xxx 0D n/a
-QU 0E F0
-KU 0F n/a
-C0 10 E0
-C1 11 E1
-C2 12 E2
-C3 13 E3
-C4 14 E4
-C5 15 E5
-C6 16 E6
-C7 17 E7
-D0 18 E8
-D1 19 E9
-D2 1A EA
-D3 1B EB
-D4 1C EC
-D5 1D ED
-D6 1E EE
-D7 1F EF
xxx 7F n/a
Any control characters marked "xxx" are treated like ASCII characters and passed through unchanged: 00 (NL), 09 (HT), 0A (LF), 0C, (FF), 0D (CR), 7F (DEL).
The OffsetTable is defined as follows (all numbers in hex):
Byte x |
OffsetTable[x] |
Coverage | |
---|---|---|---|
00 |
n/a |
<reserved for internal use> | |
01..63 |
x*80 |
half-blocks from U+0080 to U+3180 | |
64..A3 |
x*80+AE00 |
half-blocks from U+E000 to U+FF80 | |
A4..F9 |
<reserved> |
<reserved for future extension> | |
FA |
00C0 |
Latin1 letters + half of Extended A | |
FB |
0250 |
Standard Phonetic | |
FC |
0370 |
Greek | |
FD |
0530 |
Armenian | |
FE |
3040 |
Hiragana | |
FF |
FF60 |
Half-width Katakana |
This gets all the scripts that cross a half-block boundary, plus a particular useful segment of European characters. Some miscellaneous symbols and punctuation also cross, but they are unlikely to be (a) frequent, or (b) used in a solid run, and are so not included.