Unicode Technical Report #6
A Standard Compression Scheme for Unicode

Revision 1.0
Authors Misha Wolf, Ken Whistler, Charles Wicksteed, Mark Davis, and Asmus Freytag
Date 30 May 1997
This Version http://www.unicode.org/unicode/reports/tr6-10.html
Previous Version http://www.unicode.org/unicode/proposals/compress-16.html
Latest Version http://www.unicode.org/unicode/reports/tr6.html

This report presents the specifications of a compression scheme for Unicode and sample implementation. The design phase is completed with the publication of this technical report, but we still welcome information regarding implementation experience or corrigenda.

Status of this document

This document has been considered and approved by the Unicode Technical Committee for publication as a Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative, or as normative specification. Please mail corrigenda and other comments to unicore@unicode.org.


Compression Scheme for Unicode

Compressing Unicode text for transmission or storage, as mentioned in section 5.2 of The Unicode Standard, Version 2.0, is often useful. The traditional general-purpose data compression schemes (for example Huffman or LZW) are effective, but for best results they require considerable context. In the course of implementing Unicode, it became apparent that there is a need for a compression scheme that is efficient even for short strings. The compression scheme proposed here compresses Unicode text into a sequence of bytes by taking advantage of the characteristics of Unicode text. The resulting compressed sequence can be used on its own or as further input to a general-purpose file or disk-block based compression scheme. The latter achieves even better compression than either method alone.

Scope

The Standard Compression Scheme for Unicode will:

It does not attempt to avoid the use of control bytes (including NUL) in the compressed stream.

The compression scheme is mainly intended for use with short to medium length Unicode strings.The resulting compressed format is intended for storage or transmission in bandwidth limited environments. It can be used stand-alone or as input to traditional general-purpose data compression schemes. It is not intended as processing format or as general purpose interchange format.

Description

Strings in languages using small alphabets contain runs of characters that are coded close together in Unicode. These runs are typically interrupted only by punctuation characters, which are themselves coded in proximity to each other in Unicode (usually in the Basic Latin range).

The basic concept of the compression scheme is to set up a so-called dynamically positioned window, which is region of 128 consecutive characters in Unicode. This window can be positioned to contain the alphabetic characters in question. Each character that fits this window is represented as a byte between 0x80 and 0xFF in the compressed data stream. A byte from 0x20 to 0x7F (as well as CR, LF, and TAB) always means a character from the Basic Latin range (or a control character).

Runs of characters from a selected window which are intermixed only with characters from the range U+0020--U+007F can be compressed without requiring tag bytes beyond the initial setup of the window.

Tag bytes are bytes in the range 0x00 to 0x1f (except CR, LF, TAB) that are used as commands to select, define and position windows, or to escape to an uncompressed stream of Unicode text. Strings from languages using large alphabets use this uncompressed mode.

There are scripts for which the characters ordinarily show larger fluctuation in code values than can be contained in a dynamically positioned window. For these areas of the Unicode code space, windows cannot be set. Instead, an escape to uncompressed Unicode can be used.

Encoders and Decoders

There is more than one possible encoding for a given Unicode string, and it is possible to trade off speed of encoding against the compression achieved.

It is possible to write a simple encoder for this scheme which uses a subset of the allowed tags. For example it could use only SCU, SD0, UQU and UC0 and still achieve respectable compression with typical text.

Definitions

All terms not defined here shall be as defined in the Unicode Standard.
 
Single Byte Mode - a mode where each character is represented in compressed form as a single byte.
 
Unicode Mode - a mode where each character is represented by a two byte sequence, corresponding to the Unicode character value with the most significant byte first.
 
Window - a range of 128 consecutive Unicode character values.
 
Locking Shift - a permanent shift to a new active window.
 
Non-locking Shift - a non-locking shift selects a window only for the immediately following character, before returning to the active window.
 
Dynamically positioned Window - a window with a position that can be selected starting at a multiple of 128 or at one of several predefined locations. Dynamically positioned windows can be accessed by locking or non-locking shifts.
They are only used in single byte mode with bytes in the range 0x80 to 0xFF.
 
Static Window - a window with fixed position which can be accessed by non-locking shift only. They are used in single byte mode with bytes in the range 0x00 to 0x7F.
 
Tag byte - any of the predefined single byte values that select compression functions in this scheme.
 
Index byte - a byte that is used as an index into the offset table (e.g.to select a window offset).

Conformance

Decoders are required to accept and interpret the full range of tags and arguments defined here. The action of a conformant decoder on illegal or reserved input is undefined.

Conformant Encoders must not emit illegal or reserved combinations of bytes. Encoders are not required to utilize (or be able to utilize) all the features of this compression scheme. Encoders must be able to encode strings containing any valid sequence of Unicode characters. The action of a conformant encoder on malformed input is undefined.

Encoders and decoders must always start in the initial state defined below.

Compression

The Unicode Compression Scheme compresses text by defining a set of windows into the Unicode code space and interpreting byte values relative to the position of the window currently in force. Thus characters from languages that use a small alphabet can be encoded with one byte per character. By switching to Unicode mode, non-alphabetic scripts can be encoded with two bytes per character.

The compression scheme is capable of compressing strings containing any Unicode character. Some control character and private use character values overlap with the tag byte values. They can still be encoded, though at a cost of an additional byte per character.

There are two compression modes:

(In the following text all byte values are given in hex.)

Single byte mode

Compressed text in single byte mode consists of a tag byte followed by zero, one, or two argument bytes followed by one or more text bytes. Single byte mode is in effect from initialization until the end of input or until an SCU tag.

An SCU tag indicates that all following bytes are interpreted in Unicode mode.

An SQU tag indicates that the following two bytes are interpreted as a Unicode character, most significant byte first.

In single byte mode, bytes between 00 and 1F are used as tags. The tags used in single mode are shown in Table X-1, their corresponding byte values are given in Table X-6.

Table X-1. Tags for use in Single-byte Mode

Name

Meaning

Arguments

Function

SQU Quote Unicode hbyte, lbyte Quote Unicode character = (hbyte << 8) + lbyte.
Used for isolated characters that do not fit in any of the current windows.
SCU Change to Unicode   Change to Unicode mode (locking shift).
Used for runs of characters not part of a small alphabet
SQn Quote from Window n byte Non-locking shift to window n.
If the byte is in the range 00 to 7F, use static window n.
If the byte is in the range 80 to FF, use dynamically positioned window n.
SCn Change to Window n   Change to window n (locking shift).
If the byte is in the range 20 to 7F, or CR, LF, HT, use static window 0.
If the byte is in the range 80 to FF, use dynamically positioned window n.
SDn Define Window n byte Define window position n as OffsetTable[byte], and change to window n.
SDX Define Extended hbyte, lbyte Define window n in the expansion space and change to it.
n = top 3 bits of hbyte.
Window base = 10000 + (80 * remaining 13 bits of hbyte and lbyte).

Unicode Mode

In Unicode mode, each character is encoded by two bytes in the standard way with the most significant byte first. This mode has its own set of reserved byte values which are used as tags, as shown in Table X-2. Their corresponding byte values are given in Table X-6. Once selected by SCU, Unicode mode is in effect until the end of input, or until any tag that selects an active window.

Quoting in Unicode mode

Note that in Unicode mode all tags are single bytes. Therefore all bytes which are not tag bytes are the most significant bytes (MSB) of a Unicode character. Each reserved tag value collides with 256 Unicode characters.

A quoting mechanism is defined for Unicode mode to enable a character to be encoded whose first byte would collide with a tag value. The two bytes following a UQU tag are taken as a Unicode character.

The tags values used in Unicode mode are chosen so that they correspond to the most significant bytes of Unicode character values from the private use area, since private use characters are not in frequent use.

Table X-2. Tags for use in Unicode mode

Name

Meaning

Arguments

Function

UQU Quote Unicode hbyte, lbyte Quote a Unicode character.
Used to quote tag bytes.
UCn Change to Window n   Change to single mode, window n (locking-shift).
If the byte is in the range 20 to 7F, or CR, LF, HT, use static window 0.
If the byte is in the range 80 to FF, use dynamically positioned window n.
UDn Define Window n byte Define window position n as OffsetTable[byte], and change to window n.
UDX Define Extended hbyte, lbyte Define window n in the expansion space and change to it.
n = top 3 bits of hbyte
Window base = 10000 + (80 * remaining 13 bits of hbyte and lbyte)

Windows

Windows are always 128 code positions in length. There are two kinds of windows, static (or fixed position) windows and dynamically positioned windows.

Dynamically Positioned Windows

There are 8 dynamically positioned windows that are used when compressing alphabetic text. Locking shift tags in the byte stream are used to select an active window, and other tags are used to redefine the position of any window. At initialization, the dynamically positioned windows are in their default positions given in table X-5.

Locking-Shifts (Dynamically positioned windows only)

An SCn tag (or UCn tag in Unicode mode) is used for a locking shift to dynamically positioned window n. Following such a tag, bytes in the range 80 to FF represent characters in the active dynamically positioned window. Therefore any byte xx between 80 and FF encodes the Unicode character

Unicode character = DynamicOffset[n] + (xx - 80)

The values for the starting offsets of dynamically positioned windows can change. Their initial values are specified in Table X-5.

Bytes in the range 20 to 7F always represent the corresponding character from the Basic Latin block (U+0020 to U+007F). In addition, CR, LF and HT represent U+000A, U+000D and U+0009 respectively.

Window Positioning

An SDn tag (or UDn tag) followed by an index byte repositions window n and makes it the active window.

In order to keep the encoding compact, the positions of the dynamically positioned windows are not set directly but defined via a lookup table. Each window definition tag in the byte stream is followed by one byte that is used as an index into this table. The set of legal positions is defined by the Window Offset Table given in Table X-3.

The first part of the Window Offset Table defines half blocks covering the alphabetic scripts, symbols and the private use area.

The individual entries from F9 onwards cover the scripts that cross a half-block boundary, plus one useful segment of European characters. Some collections of miscellaneous symbols and punctuation would also cross half-block boundaries, but these characters are likely to occur rarely, or in isolation. Therefore no special offsets for them are included here.

Table X-3. Window Offset Table

Byte x

OffsetTable[x]

Comment

00 reserved reserved for internal use
01..67 x*80 half-blocks from U+0080 to U+3380
68..A7 x*80+AE00 half-blocks from U+E000 to U+FF80
A8..F8 reserved reserved for future use
F9 00C0 Latin1 letters + half of Extended-A
FA 0250 IPA Extensions
FB 0370 Greek
FC 0530 Armenian
FD 3040 Hiragana
FE 30A0 Katakana
FF FF60 Halfwidth Katakana

Extended Windows

An SDX tag (or UDX tag in Unicode mode) followed by two argument bytes (hbyte and lbyte) defines window n in the expansion space and makes it the active window. The window index n is given by the top 3 bits of hbyte. The window offset is calculated from the remaining thirteen bits of hbyte and lbyte as follows:

offset = 10000 + (80 * ((hbyte & 1FFF) * 100 + lbyte))

where & is the bitwise AND operator and all values are in hexadecimal notation. After an extended Window is defined each subsequent byte in the range 80 to FF represents a surrogate pair.

The following diagram shows how the bits in the two bytes following the SDX (or UDX) and a subsequent data byte map onto the bits in the resulting surrogate pair.

     hbyte         lbyte          data    nnnwwwww      zzzzzyyy      1xxxxxxx
     high-surrogate     low-surrogate    110110wwwwwzzzzz   110111yyyxxxxxxx

Non-locking Shifts And Static Windows

An SQn tag switches temporarily to a different window for just one character. The byte following the tag is interpreted relative to the window n, and then the window reverts to the previous value. This is called a non-locking shift.

If the byte following the SQn is in the range 80 to FF, dynamically positioned window n is used.

Static Windows

There are 8 static windows which are used only in conjunction with non-locking shifts. If the byte following an SQn tag is in the range 00 to 7F, static window n is used. Therefore byte xx between 00 and 7F encodes the Unicode character

Unicode character = StartingOffset[n] + xx

The positions of static windows are as given in Table X-4 and cannot be changed. They cover character ranges which contain characters that tend to occur in isolation and therefore are suitable for access via non-locking shifts.

Table X-4. Static Window Positions

Window

Starting Offset

Major Area Covered

0 0000 (for quoting of tags used in single-byte mode)
1 0080 Latin-1 Supplement
2 0100 Latin Extended-A
3 0300 Combining Diacritical Marks
4 2000 General Punctuation
5 2080 Currency Symbols
6 2100 Letterlike Symbols and Number Forms
7 3000 CJK Symbols & Punctuation

Use of SQ0

SQ0 is used specifically to quote characters that would otherwise collide with tag bytes. It may not be used with bytes in the range 20 to 7F. These values are shall not be used by encoders. Decoders are not required to detect them as errors.

Note that this restriction applies only to SQ0, which maps to ASCII. SQ1 to SQ7 may be followed by any byte value.

As in the general case of SCn a following byte value in the range 80 to FF indicates use of dynamically positioned window 0.

Initial State

The initial state of encoder and decoder is as follows:

  1. single byte mode
  2. locking-shift
  3. window 0 as the active window
  4. all windows in their default positions

Note: It is expected that encoder and decoder are reinitialized at the beginning of each string.

Initial Window Settings

Encoder and Decoder are initialized with certain default settings for the windows. These allow use of the windows without predefining them, saving a few bytes for common cases.

Encoder and Decoder always start with dynamically-positioned window 0 active, so a string of characters that consists entirely of characters from the range U+0020--U+00FF plus CR, LF, TAB is effectively converted to ISO 8859-1.

Default positions are assigned based on the following criteria:

The choice of offsets is intended to enable handling most languages by requiring at most the definition of one extra window, at the cost of a single byte.

The default settings of the dynamically positioned windows are shown in Table X-5. The static window positions are fixed and are shown above in Table X-4.

Table X-5. Default Positions for Dynamically Positioned Windows

Window

Starting Offset

Major Area Covered

0 0080 Latin-1 Supplement
1 00C0 (combined partial Latin-1 Supplement/Latin Extended-A)
2 0400 Cyrillic
3 0600 Arabic
4 0900 Devanagari
5 3040 Hiragana
6 30A0 Katakana
7 FF00 Fullwidth ASCII

Notes

Surrogate Pairs

A surrogate pair can be encoded in any of these ways:

  1. in Unicode mode, as a surrogate pair.
  2. in Single byte mode, as a surrogate pair, with each value quoted: SQU hbyte1 lbyte1 SQU hbyte2 lbyte2.
  3. in Single byte mode, as a single byte, by setting a dynamically positioned window to the appropriate position using an SDX or UDX tag.
  4. or any otherwise legal combination of 1.) and 2.).

It is not possible to set a window to the surrogate range, such that one byte would represent one half of a surrogate pair.

Private Use Area

A character in the Private Use Area can be encoded in any of these ways:

Tag Allocation

The tag byte values used in single mode are shown in Table X-6. In this table, "pass" means that the byte value (xx) represents the Unicode value U+00xx.

Table X-6. Single Mode Tag Values

Name

Value

Comment

pass 00 NUL
SQ0 - SQ7 01 - 08  
pass 09 HT
pass 0A LF
SDX 0B  
reserved 0C reserved for future use
pass 0D CR
SQU 0E  
SCU 0F  
SC0 - SC7 10 - 17  
SD0 - SD7 18 - 1F  
pass 20 - 7F  

The tag byte values used in Unicode mode are shown in Table X-7. In this table MSB means that the byte value is used as the most significant byte of a two byte sequence representing a Unicode value. There are no restrictions on the values of the byte immediately following an MSB

Table X-7. Unicode Mode Tag Values

Name

Value

Comment

MSB 00 - DF Start of a Unicode character
UC0 - UC7 E0 - E7  
UD0 - UD7 E8 - EF  
UQU F0  
UDX F1  
reserved F2 reserved for future use
MSB F3 - FF Start of a Unicode character

Examples (informative)

German

German can be written using only Basic Latin and the Latin-1 supplement, so all characters above 0x0080 use the default position of dynamically positioned window 0.

Unicode values (9 characters):

00D6 006C 0020 0066 006C 0069 0065 00DF 0074

Compressed (9 bytes):

D6 6C 20 66 6C 69 65 DF 74

Russian

Russian can use the default position of window 7. The first byte of the compressed data is the tag SC7.

Unicode values (6 characters):

041C 043E 0441 043A 0432 0430

Compressed (7 bytes):

17 9C BE C1 BA B2 B0

All Features

The following sample compressed string contains all the features of the compression scheme, but limited to only one instance of the eight SQn and the seventeen SCn/UCn, SDn/UDn, and SDX/UDX pairs. (As appropriate for a contrived example, the string actually expands.)

Unicode values (10 characters)

0041 00DF 0401 015F 00DF DFDF 01DF F000 DBFF DFFF

Compressed (24 bytes)

41 DF 11 DF 03 01 03 DF 19 01 DF 0E DF DF 0F 01 DF F0 F0 00 F1 FF FF FF


Possible Private Extensions (informative)

During the design and review phase of the compression scheme, extensions were repeatedly suggested to handle the two following situations. Since these extensions were not accepted as part of the compression scheme itself, it was felt useful to document them here.This will show how to solve certain problems by adding higher level protocols.

1. Avoiding (most) control byte values

After encoding, replace any control byte by DLE (0x10) followed by the same byte + 0x40. NUL becomes DLE followed by '@' (0x40). DLE is replaced by DLE followed by U+0050. Before decoding, perform the opposite transformation.

2. Handling runs of the same character

Before encoding, replace any run of 4 or more Unicode characters by '@' (U+0040), followed by the character to repeat, followed by a 16-bit count (packed into one Unicode character). The sequence ------------------------ becomes '@' '-' '!'. The @ sign itself is replaced by @@U+0001. After decoding, perform the reverse operation.


Acknowledgments

The authors would like to thank to Dr. Laura Wideburg for assistance in copy editing.

Authors

The original concept of a standard compression scheme for Unicode was implemented at Reuters and proposed by Misha Wolf and Charles Wicksteed. Extensions and refinements were proposed by Mark Davis, Ken Whistler and Martin Duerst. The final text for the Technical report and the sample implementations were created by Asmus Freytag.

Changes from previous revisions:

Added a pointer to the sample implementation.

Copyright

Copyright 1996-1997 Unicode, Inc.. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

The accompanying sample implementation is provided as is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determin applicability of information provided.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.