Draft Addition to the Unicode Standard

Version

1.4

Author

Mark Davis

Title

Simple Unicode Text Compression Format

The following presents a concrete proposal for a simple text compression scheme for Unicode, embodying ideas already presented by Misha Wolf, plus further refinements from Ken Whistler and myself. The goal is to have a simple, standard way of compressing Uniocde that can be standard across different platforms, much like UTF-8.

Note: this compresses Unicode text into 8-bit binary data. This data is a compression format, NOT an encoding format.

Design Goals

Fast

Must be very fast; shouldn't degrade performance significantly for compression OR decompression. Decompression is the most crucial; compression performance could be traded off for quality.

Targeted at short to medium-sized strings, real world text.

Should work for small strings as well as large. Tuned to Unicod e allocation and expected useage: longish runs of characters from a single script, with some intermixed symbols or punctuation. Doesn't have to work for random character strings, or be a general-purpose compression algorithm.

Simple

Should be simple code, easy to implement and maintain, fast to download.

Competitive with 2022

Should leave large character sets (Han, Hangul) at about 2 bytes/char, and small character sets (English, French, Arabic) at about 1 byte/char. A Latin-1 sequence, in particular, just encodes as Latin-1.

Complete, Non-Lossy

Every Unicode character can be represented, and no characters are lost in compression/decompression.

Layerable

The output could be further compressed by standard byte-based Huffman or LZW algorithms.

Proposal

For performance reasons, we restrict ourselves to simple byte encodings, and not bit-twiddles. We can do a fast switch on certain "tag" byte values which are chosen to be in a small contiguous range, then have tight loops with simple exit criteria.

There are two modes: single-byte mode, and raw Unicode mode. Single byte mode treats bytes of less than 80 as ASCII (except for tags). Bytes over 80 are shifted according to a current "window" offset, to map onto 80 bytes somewhere else in Unicode. There are separate defined offsets that can be switched to, called window-0, window-1,etc. There are restrictions to the windows; we do not attempt to map Han or Hangul characters because their distribution characteristics do not allow frequent grouping within small blocks.

We choose the tags to be from 00 to 1F, skipping common controls like CR. NOTE: since this is a compression format, and not interworkable as-is with another character encoding, any infrequent characters could be chosen!

The tags within raw Unicode mode are chosen from the private use area , since they are not in frequent use.

There are 8 dynamic windows accessable from different controls. In addition, there are 8 static windows that are only accessable with SQn. These are used to quote common punctuation characters, ones that are not typically found in runs. Window 8 is used to quote a tag byte.

In the table below, hbyte is the high byte of a Unicode character (e.g. (unichar >> 8) & 0xFF), and lbyte is the low byte of a Unicode character (e.g. unichar & 0xFF)

Single-Mode Tags

Tag

Arguments

Function

SQU

hbyte, lbyte

Quote a Unicode character (e.g. single-shift Unicode)
Used to quote tag bytes.

SCU

Change to Unicode mode. (e.g. locking-shift Unicode)

SQn

byte

Quote a character from window n. (e.g. single-shift W0)

SCn

Change to window n. (e.g. locking-shift W0)

SDn

byte

Define window n as OffsetTable[byte], and change to it.

SKU

<change to compatibility Unicode mode>

Unicode-Mode Tags

Tag

Arguments

Function

UQU

hbyte, lbyte

Quote a Unicode character (e.g. single-shift)
Used to quote tag bytes.

UC0

Change to Single-Mode, window n (e.g. locking-shift)
Similar to SCn

UD0

byte

Define window n as OffsetTable[byte], and change to it.

Compatibility Unicode-Mode

This mode does not use tags, instead, any Unicode character T from E000 to EFFF terminates the mode. This termination character T also redefines Window 0, using an offset of ((T << 4) & 0xFFF0). (The mask is unnecessary on machines with 16bit unsigned shorts)

Surrogates

Surrogates can be encoded in Unicode mode. In single-mode, they can be quoted as two characters: SQU hbyte1 lbyte1 SQU hbyte2 lbyte2.

Issue : We could recognize SQU hbyte1 lbyte1 hbyte2 lbyte2 as a special case. This makes quoted surrogates slightly smaller, but does complicate the algorithm slightly.

Issue : There is no provision for setting a window to a surrogate range. We could use one new tag to do so, e.g. SSD x, y, where the top 3 bits of x are the window, and the remaining bits of x and y indicate a multiple of 0x80 to be added to 0x10000.

Default Window Settings (Draft)

The format assumes certain default settings for the windows; these allow use of the windows without predefining them, saving a few bytes for common cases. The choice of offsets is not meant to put other languages at a disadvantage; they are to enable handling most languages by requiring at most the definition of one extra window, at a cost of a single byte.

You always start out in Window 0, so a string of pure Latin-1 graphic characters is converted with no change.

Dynamic Windows

Window

Major Area Covered

Starting Offset

0

Latin-1

0x0080

1

Latin-1 Extended A

0x0100

2

Hiragana

0x3040

3

Katakana

0x3080

4

Fullwidth Variants

0xFF00

5

Halfwidth Kana

0xFF60

6

Greek

0x0370

7

Cyrillic

0x0400

Static Windows

Window

Major Area Covered

Starting Offset

8

ASCII (for quoted tags

0x0000

9

Latin-1

0x0080

10

Combining Diacriticals

0x0300

11

General Punctuation

0x2000

12

Currency Symbols

0x2080

13

CJK Symbols & Punct

0x3000

14

tbd

tbd

15

tbd

tbd

Tag Allocation (Draft)

    S- U-
xxx 00 n/a
-Q0 01 n/a
-Q1 02 n/a
-Q2 03 n/a
-Q3 04 n/a
-Q4 05 n/a
-Q5 06 n/a
-Q6 07 n/a
-Q7 08 n/a
xxx 09 n/a
xxx 0A n/a
-CU 0B n/a
xxx 0C n/a
xxx 0D n/a
-QU 0E F0
-KU 0F n/a
-C0 10 E0
-C1 11 E1
-C2 12 E2
-C3 13 E3
-C4 14 E4
-C5 15 E5
-C6 16 E6
-C7 17 E7
-D0 18 E8
-D1 19 E9
-D2 1A EA
-D3 1B EB
-D4 1C EC
-D5 1D ED
-D6 1E EE
-D7 1F EF
xxx 7F n/a

Any control characters marked "xxx" are treated like ASCII characters and passed through unchanged: 00 (NL), 09 (HT), 0A (LF), 0C, (FF), 0D (CR), 7F (DEL).

Offset Table (Draft)

The OffsetTable is defined as follows (all numbers in hex):

Byte x

OffsetTable[x]

Coverage

00

n/a

<reserved for internal use>

01..63

x*80

half-blocks from U+0080 to U+3180

64..A3

x*80+AE00

half-blocks from U+E000 to U+FF80

A4..F9

<reserved>

<reserved for future extension>

FA

00C0

Latin1 letters + half of Extended A

FB

0250

Standard Phonetic

FC

0370

Greek

FD

0530

Armenian

FE

3040

Hiragana

FF

FF60

Half-width Katakana

This gets all the scripts that cross a half-block boundary, plus a particular useful segment of European characters. Some miscellaneous symbols and punctuation also cross, but they are unlikely to be (a) frequent, or (b) used in a solid run, and are so not included.