Proposed Update Unicode Technical Standard #6
L2/04-020

A Standard Compression Scheme for Unicode

Version	3.6
Authors	Misha Wolf, Ken Whistler, Charles Wicksteed, Mark Davis, Asmus Freytag, and Markus Scherer
Date	2004-01-02
This Version	http://www.unicode.org/reports/tr6/tr6-3.6.html (http://www.mindspring.com/~markus.scherer/unicode/tr6/tr6-3.6-20040102.html)
Previous Version	http://www.unicode.org/reports/tr6/tr6-3.5.html
Latest Version	http://www.unicode.org/reports/tr6/
Base Unicode Version	Unicode 2.0.0

Summary

This report presents the specifications of a compression scheme for Unicode and sample implementation.

Status

This document is a proposed update of a previously approved Unicode Technical Standard. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. Each UTS specifies a base version of the Unicode Standard. Conformance to the UTS requires conformance to that version or higher.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in [References]. The latest version of the Unicode Standard is found on [Unicode]. A list of current Unicode Technical Reports is found on [Reports]. For more information about versions of the Unicode Standard, see [Versions].

1 Scope

2 Description

2.1 Compression Scheme for Unicode

2.2 Encoders and Decoders

5 Compression

5.2 Unicode Mode

5.2.1 Quoting in Unicode Mode

6 Windows

6.1 Dynamically Positioned Windows

6.1.1 Locking Shifts (Dynamically Positioned Windows Only)

6.1.2 Window Positioning

6.1.3 Extended Windows

6.2 Non-locking Shifts and Static Windows

6.2.1 Static Windows

6.2.2 Use of SQ0

7 Initial State

7.1 Initial Window Settings

8 Notes

8.1 Surrogate Pairs

8.2 Private Use Area

8.3 Tag Allocation

8.4 Signature Byte Sequence for SCSU

8.5 Worst Case Behavior for SCSU

9 Examples

10 Possible Private Extensions

10.1 Avoiding Control Byte Values

10.2 Handling Runs of the Same Characters

1 Scope

The Standard Compression Scheme for Unicode will:

express all code points in Unicode
approximate the storage size of traditional character sets
work well for short strings
provide transparency for characters between U+0020-U+00FF, as well as CR, LF and TAB.
support very simple decoders
support simple as well as sophisticated encoders

It does not attempt to avoid the use of control bytes (including NUL) in the compressed stream, and does not attempt to preserve binary ordering of strings.

The compression scheme is mainly intended for use with short to medium length Unicode strings. The resulting compressed format is intended for storage or transmission in bandwidth limited environments. It can be used stand-alone or as input to traditional general purpose data compression schemes. It is not intended as processing format or as general purpose interchange format.

2 Description

The following description is stated as an encoding of a sequence of Unicode characters as a compressed stream of bytes. It is therefore independent, for example, on whether the uncompressed data is encoded as UTF-8, UTF-16 or UTF-32 (aka UCS-4 in ISO 10646). If the compressed data consists of the same sequence of bytes, it represents the same sequence of characters. The reverse is not true — there are multiple ways of compressing any character sequence.

While the description uses the term character throughout, no limitation to assigned characters is implied, in other words, SCSU is strictly speaking defined in terms of code points.

2.1 Compression Scheme for Unicode

Compressing Unicode text for transmission or storage, as mentioned in section 5.2 of The Unicode Standard, Version 2.0, is often useful. The traditional general purpose data compression schemes (for example Huffman or LZW) are effective, but for best results they require considerable context. In the course of implementing Unicode, it became apparent that there is a need for a compression scheme that is efficient even for short strings. The compression scheme proposed here compresses Unicode text into a sequence of bytes by taking advantage of the characteristics of Unicode text. The resulting compressed sequence can be used on its own or as further input to a general purpose file or disk-block based compression scheme. The latter achieves even better compression than either method alone.

Strings in languages using small alphabets contain runs of characters that are coded close together in Unicode. These runs are typically interrupted only by punctuation characters, which are themselves coded in proximity to each other in Unicode (usually in the Basic Latin range).

The basic concept of the compression scheme is to set up a so-called dynamically positioned window, which is a region of 128 consecutive characters in Unicode. This window can be positioned to contain the alphabetic characters in question. Each character that fits this window is represented as a byte between 0x80 and 0xFF in the compressed data stream, while any character from the Basic Latin range (as well as CR, LF, and TAB) are represented by a byte in the range 0x20 to 0x7F (as well as 0x0D, 0x0A or 0x09).

Runs of characters from a selected window which are intermixed only with characters from the range U+0020..U+007F can be compressed without requiring tag bytes beyond the initial setup of the window.

Tag bytes are bytes in the range 0x00 to 0x1F (except CR, LF, TAB) that are used as commands to select, define and position windows, or to escape to an uncompressed stream of Unicode text. Strings from languages using large alphabets use this uncompressed mode.

There are scripts for which the characters ordinarily show larger fluctuation in code values than can be contained in a dynamically positioned window. For these areas of the Unicode code space, windows cannot be set. Instead, an escape to uncompressed Unicode can be used.

2.2 Encoders and Decoders

There is more than one possible encoding for a given Unicode string, and it is possible to trade off speed of encoding against the compression achieved.

It is possible to write a simple encoder for this scheme which uses a subset of the allowed tags. For example it could use only SCU, SD0, UQU and UC0 and still achieve respectable compression with typical text. See Section 8.7 Minimal Encoder for further discussion and sample code.

Encoders should follow the recommendations in Section 8.6 XML Suitability so that they can be used to encode XML, HTML and similar document formats.

2.3 Limitations

SCSU does not attempt to avoid the use of control bytes (including NUL) in the compressed stream. It is sometimes possible to escape control characters in the manner of Section 10.1, but this requires an additional agreement between sender and receiver.

SCSU also does not attempt to preserve the binary ordering of strings, and is not MIME compatible, which limits its attractiveness are a processing format, particularly in databases, or as general purpose interchange format, respectively. If these features are required, a different compression scheme, such as [BOCU] could be employed.

3 Definitions

All terms not defined here shall be as defined in the Unicode Standard or in the online [Glossary].

Single Byte Mode - a mode where each character is represented in compressed form as a single byte.

Unicode Mode - a mode where each character is represented by big-endian UTF-16.

Window - a range of 128 consecutive Unicode character values.

Locking Shift - a permanent shift to a new active window.

Non-locking Shift - a non-locking shift selects a window only for the immediately following character, before returning to the active window.

Dynamically positioned Window - a window with a position that can be selected starting at a multiple of 128 or at one of several predefined locations. Dynamically positioned windows can be accessed by locking or non-locking shifts.
They are only used in single byte mode with bytes in the range 0x80 to 0xFF.

Static Window - a window with fixed position which can be accessed by non-locking shift only. They are used in single byte mode with bytes in the range 0x00 to 0x7F.

Tag byte - any of the predefined single byte values that select compression functions in this scheme.

Index byte - a byte that is used as an index into the offset table (e.g.to select a window offset).
Supplementary code space - the code space accessed by surrogate pairs in UTF-16.

4 Conformance

Decoders are required to accept and interpret the full range of tags and arguments defined here. The action of a conformant decoder on illegal or reserved input is undefined.

Conformant Encoders must not emit illegal or reserved combinations of bytes. Encoders are not required to utilize (or be able to utilize) all the features of this compression scheme. Encoders must be able to encode strings containing any valid sequence of Unicode characters. The action of a conformant encoder on malformed input is undefined.

Encoders and decoders must always start in the initial state defined below. Encoders must remain in Single Byte Mode at least until the first code point is encountered that is not U+0000 (NUL), U+0009 (HT), U+000A (LF), U+000D (CR), or U+0020..U+00FF (Latin-1), or an initial U+FEFF. See Section 8.4 Signature Byte Sequence for SCSU and section 8.6 XML Suitability.

5 Compression

The Unicode Compression Scheme compresses text by defining a set of windows into the Unicode code space and interpreting byte values relative to the position of the window currently in force. Thus characters from languages that use a small alphabet can be encoded with one byte per character. By switching to Unicode mode, non-alphabetic scripts can be encoded with two bytes per character on the BMP or four bytes per supplementary character.

The compression scheme is capable of compressing strings containing any Unicode character. Some control character and private use character values overlap with the tag byte values. They can still be encoded, though at a cost of an additional byte per character.

There are two compression modes:

single byte mode, where each byte represents one character and is interpreted according to the current window setting.
Unicode mode, where each character is represented as big-endian UTF-16.

(In the following text all byte values are given in hex.)

5.1 Single Byte Mode

Compressed text in single byte mode consists of a tag byte followed by zero, one, or two argument bytes followed by one or more text bytes. Single byte mode is in effect from initialization until the end of input or until an SCU tag. An SCU tag indicates that all following bytes are interpreted in Unicode mode as big-endian UTF-16. An SQU tag indicates that the following two bytes are interpreted as a sixteen bit Unicode BMP character, most significant byte first.

In single byte mode, bytes between 00 and 1F are used as tags. The tags used in single mode are shown in Table 1, their corresponding byte values are given in Table 6.

Table 1. Tags for use in Single-byte Mode

Name	Meaning	Arguments	Function
SQU	Quote Unicode	hbyte, lbyte	Quote Unicode character = (hbyte << 8) + lbyte. Used for isolated characters from the BMP that do not fit in any of the current windows.
SCU	Change to Unicode		Change to UTF-16 mode (locking shift). Used for runs of characters not part of a small alphabet
SQn	Quote from Window n .	byte	Non-locking shift to window n. If the byte is in the range 00 to 7F, use static window n. If the byte is in the range 80 to FF, use dynamically positioned window n.
SCn	Change to Window n		Change to window n (locking shift). Use static window 0 for all following bytes that are in the range 20 to 7F, or CR, LF, HT. Use dynamically positioned window n for all following bytes that are in the range 80 to FF.
SDn	Define Window n	byte	Define window position n as OffsetTable[byte], and change to window n.
SDX	Define Extended	hbyte, lbyte	Define window n in the supplementary code space and change to it. n = top 3 bits of hbyte. Window base = 10000 + (80 * remaining 13 bits of hbyte and lbyte).

5.2 Unicode Mode

In Unicode mode, each character is encoded by two or four bytes as big-endian UTF-16, i.e. with the most significant byte first. This mode has its own set of reserved byte values which are used as tags, as shown in Table 2. Their corresponding byte values are given in Table 6. Once selected by SCU, Unicode mode is in effect until the end of input, or until any tag that selects an active window.

5.2.1 Quoting in Unicode mode

Note that in Unicode mode all tags are single bytes. Therefore all bytes which are not tag bytes are the most significant bytes (MSB) of a Unicode character. Each reserved tag value collides with 256 Unicode characters. A quoting mechanism is defined for Unicode mode to enable a character to be encoded whose first byte would collide with a tag value. The two bytes following a UQU tag are taken as a Unicode character on the BMP. The tags values used in Unicode mode are chosen so that they correspond to the most significant bytes of Unicode character values from the private use area, since private use characters are not in frequent use.

Table 2. Tags for use in Unicode mode

Name	Meaning	Arguments	Function
UQU	Quote Unicode	hbyte, lbyte	Quote a Unicode BMP character. Used to quote tag bytes.
UCn	Change to Window n		Change to single mode, window n (locking shift). Use static window 0 for all following bytes that are in the range 20 to 7F, or CR, LF, HT. Use dynamically positioned window n for all following bytes that are in the range 80 to FF.
UDn	Define Window n	byte	Define window position n as OffsetTable[byte], and change to window n.
UDX	Define Extended	hbyte, lbyte	Define window n in the supplementary code space and change to it. n = top 3 bits of hbyte Window base = 10000 + (80 * remaining 13 bits of hbyte and lbyte)

6 Windows

Windows are always 128 code positions in length. There are two kinds of windows, static (or fixed position) windows and dynamically positioned windows.

6.1 Dynamically Positioned Windows

There are 8 dynamically positioned windows that are used when compressing alphabetic text. Locking shift tags in the byte stream are used to select an active window, and other tags are used to redefine the position of any window. At initialization, the dynamically positioned windows are in their default positions given in Table 5.

6.1.1 Locking Shifts (Dynamically positioned windows only)

An SCn tag (or UCn tag in Unicode mode) is used for a locking shift to dynamically positioned window n. Following such a tag, bytes in the range 80 to FF represent characters in the active dynamically positioned window. Therefore any byte xx between 80 and FF encodes the Unicode character

Unicode character = DynamicOffset[n] + (xx - 80)

The values for the starting offsets of dynamically positioned windows can change. Their initial values are specified in Table 5. Bytes in the range 20 to 7F always represent the corresponding character from the Basic Latin block (U+0020 to U+007F). In addition, LF, CR and HT represent U+000A, U+000D and U+0009 respectively.

6.1.2 Window Positioning

An SDn tag (or UDn tag) followed by an index byte repositions window n and makes it the active window. In order to keep the encoding compact, the positions of the dynamically positioned windows are not set directly but defined via a lookup table. Each window definition tag in the byte stream is followed by one byte that is used as an index into this table. The set of legal positions is defined by the Window Offset Table given in Table 3.

The first part of the Window Offset Table defines half blocks covering the alphabetic scripts, symbols and the private use area. The individual entries from F9 onwards cover the scripts that cross a half-block boundary, plus one useful segment of European characters. Some collections of miscellaneous symbols and punctuation would also cross half-block boundaries, but these characters are likely to occur rarely, or in isolation. Therefore no special offsets for them are included here.

6.1.2 Table 3. Window Offset Table

Byte x	OffsetTable[x]	Comment
00	reserved	reserved for internal use
01..67	x*80	half-blocks from U+0080 to U+3380
68..A7	x*80+AC00	half-blocks from U+E000 to U+FF80
A8..F8	reserved	reserved for future use
F9	00C0	Latin1 letters + half of Extended-A
FA	0250	IPA Extensions
FB	0370	Greek
FC	0530	Armenian
FD	3040	Hiragana
FE	30A0	Katakana
FF	FF60	Halfwidth Katakana

6.1.3 Extended Windows

An SDX tag (or UDX tag in Unicode mode) followed by two argument bytes (hbyte and lbyte) defines window n in the supplementary code space and makes it the active window. The window index n is given by the top 3 bits of hbyte. The window offset is calculated from the remaining thirteen bits of hbyte and lbyte as follows:

offset = 10000 + (80 * ((hbyte & 1F) * 100 + lbyte))

where & is the bitwise AND operator and all values are in hexadecimal notation. After an extended window is defined each subsequent byte in the range 80 to FF represents a character from the supplementary code space.

For example, when decoding SCSU into UTf-16, the bits in the two argument bytes following the SDX (or UDX) and a subsequent data byte map onto the bits in the resulting surrogate pair as shown in the following diagram.

     hbyte         lbyte          data    nnnwwwww      zzzzzyyy      1xxxxxxx

     high-surrogate     low-surrogate    110110wwwwwzzzzz   110111yyyxxxxxxx

6.2 Non-locking Shifts and Static Windows

An SQn tag switches temporarily to a different window for just one character. The byte following the tag is interpreted relative to the window n, and then the window reverts to the previous value. This is called a non-locking shift. If the byte following the SQn is in the range 80 to FF, dynamically positioned window n is used.

6.2.1 Static Windows

There are 8 static windows, seven of which are used only in conjunction with non-locking shifts. If any data byte following an SQn tag is in the range 00 to 7F, static window n is used. Therefore byte xx between 00 and 7F encodes the Unicode character

Unicode character = StartingOffset[n] + xx

The positions of static windows are as given in Table 4 and cannot be changed. They cover character ranges which contain characters that tend to occur in isolation and therefore are suitable for access via non-locking shifts. Static window 0 is also used when bytes following an SCn or UCn are in the range 20 to 7F.

Table 4. Static Window Positions

Window	Starting Offset	Major Area Covered
0	0000	(for quoting of tags used in single-byte mode)
1	0080	Latin-1 Supplement
2	0100	Latin Extended-A
3	0300	Combining Diacritical Marks
4	2000	General Punctuation
5	2080	Currency Symbols
6	2100	Letterlike Symbols and Number Forms
7	3000	CJK Symbols & Punctuation

6.2.2 Use of SQ0

SQ0 is used specifically to quote characters that would otherwise collide with tag bytes. It may not be used with bytes in the range 20 to 7F. These values shall not be used by encoders. Decoders are not required to detect them as errors. Note that this restriction applies only to SQ0, which maps to ASCII. SQ1 to SQ7 may be followed by any byte value.

As in the general case of SCn, a following byte value in the range 80 to FF indicates use of dynamically positioned window 0.

7 Initial State

The initial state of encoder and decoder is as follows:

single byte mode
locking shift
window 0 as the active window
all windows in their default positions

Note: For APIs or data stream mixing text and data it is expected that encoder and decoder are reinitialized at the beginning of each string, or compressible chunk of text data.

7.1 Initial Window Settings

Encoder and Decoder are initialized with certain default settings for the windows. These allow use of the windows without predefining them, saving a few bytes for common cases. Encoder and Decoder always start with dynamically-positioned window 0 active, so a string of characters that consists entirely of characters from the range U+0020..U+00FF plus CR, LF, TAB is effectively converted to ISO 8859-1.

Default positions are assigned based on the following criteria:

Dynamically positioned windows: Frequently occurring ranges of character which commonly appear in runs containing characters in the selected range or intermixed with characters in the range U+0020..U+007F.
Static windows: ranges of characters which commonly occur in isolation.

The choice of offsets is intended to enable handling most languages by requiring at most the definition of one extra window, at the cost of a single byte. The default settings of the dynamically positioned windows are shown in Table 5. The static window positions are fixed and are shown above in Table 4.

Table 5. Default Positions for Dynamically Positioned Windows

Window	Starting Offset	Major Area Covered
0	0080	Latin-1 Supplement
1	00C0	(combined partial Latin-1 Supplement/Latin Extended-A)
2	0400	Cyrillic
3	0600	Arabic
4	0900	Devanagari
5	3040	Hiragana
6	30A0	Katakana
7	FF00	Fullwidth ASCII

8 Notes

8.1 Surrogate Pairs

A supplementary character, i.e. a character corresponding to a surrogate pair in UTF-16, can be encoded in any of these ways:

in Unicode mode, as a surrogate pair.
in Single byte mode, as a surrogate pair, with each value quoted: SQU hbyte1 lbyte1 SQU hbyte2 lbyte2.
any otherwise legal combination of the above
or in Single byte mode, as a single byte, by setting a dynamically positioned window to the appropriate position using an SDX or UDX tag.

It is not possible to set a window to the surrogate range, such that one byte would represent one half of a surrogate pair. However, it is not required that the encoding for both halves of a surrogate pair use the same method.

Note: All conformant decoders that output UTF-8 or UTF-32 must be prepared to convert surrogate pairs to characters, even for the case SQU hbyte1 lbyte1 SQU hbyte2 lbyte2.

8.2 Private Use Area

A character in the Private Use Area on the BMP can be encoded in any of these ways:

in Unicode mode, by quoting with UQU.
in Unicode mode, if above F2FF, with no quoting.
in Single byte mode, by quoting with SQU.
in Single byte mode, as a single byte, by setting a dynamically positioned window to the required position in the Private Use Area using an SDn or UDn tag.

8.3 Tag Allocation

The tag byte values used in single mode are shown in Table 6. In this table, "pass" means that the byte value (xx) represents the Unicode code point U+00xx.

Table 6. Single Mode Tag Values

Name	Value	Comment
pass	00	NUL
SQ0 - SQ7	01 - 08
pass	09	HT
pass	0A	LF
SDX	0B
reserved	0C	reserved for future use
pass	0D	CR
SQU	0E
SCU	0F
SC0 - SC7	10 - 17
SD0 - SD7	18 - 1F
pass	20 - 7F

The tag byte values used in Unicode mode are shown in Table 7. In this table MSB means that the byte value is used as the most significant byte of a two byte sequence representing a Unicode code point on the BMP. There are no restrictions on the values of the byte immediately following an MSB.

Table 7. Unicode Mode Tag Values

Name	Value	Comment
MSB	00 - DF	Start of a Unicode character
UC0 - UC7	E0 - E7
UD0 - UD7	E8 - EF
UQU	F0
UDX	F1
reserved	F2	reserved for future use
MSB	F3 - FF	Start of a Unicode character

8.4 Signature Byte Sequence for SCSU (informative)

Where data streams are not tagged externally, it is useful to provide a signature at the beginning of the stream. For UTF-16, UTF-32 and UTF-8, this is done by the use of U+FEFF, a value chosen to not only allow identification of the text as Unicode, but also to distinguish little-endian from big-endian forms of UTF-16 and UTF-32. For more information on the general use of signatures, see The Unicode Standard, Version 3.0, Section 13.6.

Unlike the standard encoding forms, SCSU does not have a single representation for U+FEFF. Depending on the implementation of an SCSU encoder, and depending on the following text, a leading U+FEFF character could be encoded as one of these initial byte sequences (hexadecimal, not showing following text):

Bytes	Commands	Comment
0E FE FF	SQU FE FF	Single-byte mode Quote Unicode. Recommended.
0F FE FF	SCU FE FF	Single-byte mode Change to Unicode
18 A5 FF	SD0 A5 FF	Single-byte mode Define dynamic window 0 to 0xFE80
19 A5 FF	SD1 A5 FF	Single-byte mode Define dynamic window 1 to 0xFE80
1A A5 FF	SD2 A5 FF	Single-byte mode Define dynamic window 2 to 0xFE80
1B A5 FF	SD3 A5 FF	Single-byte mode Define dynamic window 3 to 0xFE80
1C A5 FF	SD4 A5 FF	Single-byte mode Define dynamic window 4 to 0xFE80
1D A5 FF	SD5 A5 FF	Single-byte mode Define dynamic window 5 to 0xFE80
1E A5 FF	SD6 A5 FF	Single-byte mode Define dynamic window 6 to 0xFE80
1F A5 FF	SD7 A5 FF	Single-byte mode Define dynamic window 7 to 0xFE80

It is recommended to use only the byte sequence <0E FE FF> for an initial U+FEFF character (0E is the "SQU" tag). This convention will assist receiving processes that use initial byte sequences to identify a data file or stream as being encoded in SCSU. Every SCSU encoder should write this particular initial byte sequence if a U+FEFF is encountered as the first character in the stream. Any further occurrences of this character may be encoded in the most compact way possible with SCSU.

Note: The recommended sequence is the only one that does not affect the state of the encoder or decoder, and may be safely stripped by a receiver even before initiating a decoder.

A process reading text from a file or stream could interpret the initial bytes <0E FE FF> as a signature for SCSU and assume the file or stream to be encoded with SCSU. The process or SCSU decoder may or may not strip the initial U+FEFF character from the resulting text. Any other encoding of an initial U+FEFF character, and any encoding of a U+FEFF after the initial character are normally interpreted as a ZWNBSP

Note: If the input text starts with a U+FEFF that is to be interpreted as a ZWNBSP, then an encoder or sending process may prepend the text with another U+FEFF which may be safely recognized as an SCSU signature and stripped by a receiving process. Otherwise, the initial ZWNBSP could itself be misinterpreted as a signature and stripped by a receiving process. This is equivalent to sending and receiving text in UTF-16 or UTF-32.A signature should not be used where a protocol specification, database design, or out-of-band information or similar specifies the encoding.

8.5 Worst Case Behavior (informative)

By using SCU + (input string in UTF-16) almost all Unicode strings can be represented with the same number of bytes as their UTF-16 encoding + 1 byte. The exception are strings containing those private use characters for which the MSB collides with the tag byte values. These characters must be quoted with SQU or UQU, requiring 3 bytes instead of 2 bytes per character. Therefore, an absolute upper bound of required SCSU length is 3 bytes per UTF-16 code unit. (See also section 5.21). This upper bound is reached only for strings of n characters containing at least n-1 private use characters subject to the quoting requirement.

Since the characters requiring SQU or UQU are in the BMP, an SCSU encoded string is never required to be longer than four bytes per character. In other words, it is never longer than its UTF-32 encoding. For supplementary characters there is no need for a 1 byte overhead, since any supplementary character can be represented using four bytes in SCSU by using SDX. (See also section 6.1.3).

A Unicode string consisting entirely of certain control characters will take up twice as much space when encoded in SCSU than when encoded in UTF-8, since each control character must be individually quoted with SQ0. (See also section 5.1).

All of these upper bounds can be exceeded, if an encoder deliberately chooses a particularly inefficient representation, such as using SQU or UQU to quote each surrogate separately for characters in the supplementary code space (see also section 8.1), or inserting redundant tags.

Typical compression of average text is markedly better than the worst case behavior and tends to be better than the shorter of the UTF-8 or UTF-16 encoding of the given character string.

8.6 XML Suitability (informative)

SCSU can be used for XML or HTML or similar documents if attention is paid to the in-document encoding declaration. The process emitting the document should place the encoding declaration at the earliest possible place, before any non-Latin-1 characters. Such documents can be parsed properly up to and including the encoding declaration, because many document parsers initially assume ASCII-compatible encodings. (See also Section F of XML 1.0.)

An SCSU encoder is XML-Suitable if it encodes all initial Latin-1 text (code points U+0000, U+0009, U+000A, U+000D, U+0020..U+00FF) in the shortest possible form. That is, it uses Single Byte Mode without SQ0, SC0 or any other commands. This encodes initial Latin-1 text with the same bytes as with ISO 8859-1. Note that it would be unusual for an SCSU encoder to not encode initial Latin-1 text in the shortest form, so most existing SCSU encoders are XML-Suitable.

If there were an initial U+FEFF indicating a Unicode encoding signature, it would be encoded with SQU (see Section 8.4 Signature Byte Sequence for SCSU). However, many HTML and XML parsers do not recognize Unicode encoding signatures other than for UTF-16, so such a signature should not be used with XML and HTML documents.

8.7 Minimal Encoder (informative)

While it is straightforward to write an SCSU decoder, writing an encoder may seem complicated because there are many ways to encode the same text. The choices that are made for an implementation affect the achievable compression ratio.

However, it is quite simple to write a minimal SCSU encoder that still produces valid and reasonable, even XML-suitable, output. The scsumini.c sample C code demonstrates this; its encoder function consists of about 75 lines of C code and uses only one integer state variable (for single-byte vs. Unicode mode and the current window). It uses most SCSU commands, including quoting from and switching to all pre-defined windows, but does not define dynamic windows and does not use any look-ahead.

This kind of encoder is sufficient for small amounts of text (like web form data sent back to a server) and generally for text with mostly Latin/Cyrillic/Arabic/Devanagari/Japanese characters and CJK ideographs.

Even an encoder with good compression performance is relatively easy to write. Most of the choices are obvious. For example: Use all static and dynamic windows; use the current window if possible; use a static window if a matching character is found; switch to Unicode mode for uncompressible text; switch to an already-defined window if a matching character is found; quote a standalone character; define a new window for a string of compressible characters. (This is essentially what the [ICU] SCSU converter does, with a one-character look-ahead.)

For optimal compression, an encoder would have to look ahead several characters and probably compare multiple alternatives for sections of the text. The compression of normal text may improve only by a relatively small percentage compared to the strategy outlined in the previous paragraph.

9 Examples (informative)

9.1 German

German can be written using only Basic Latin and the Latin-1 supplement, so all characters above 0x0080 use the default position of dynamically positioned window 0.

Unicode characters (9 characters):

00D6 006C 0020 0066 006C 0069 0065 00DF 0074

Compressed (9 bytes):

D6 6C 20 66 6C 69 65 DF 74

9.2 Russian

Russian can use the default position of window 2. The first byte of the compressed data is the tag SC2.

Unicode characters (6 characters):

041C 043E 0441 043A 0432 0430

Compressed (7 bytes):

12 9C BE C1 BA B2 B0

9.3 Japanese

Japanese text almost always profits from the multiple predefined windows in SCSU. For more details on this sample click here.

Unicode characters (116 characters)

3000 266a 30ea 30f3 30b4 53ef 611b 3044 3084 53ef 611b 3044 3084 30ea 30f3 30b4 3002 534a 4e16 7d00 3082 524d 306b 6d41 884c 3057 305f 300c 30ea 30f3 30b4 306e 6b4c 300d 304c 3074 3063 305f 308a 3059 308b 304b 3082 3057 308c 306a 3044 3002 7c73 30a2 30c3 30d7 30eb 30b3 30f3 30d4 30e5 30fc 30bf 793e 306e 30d1 30bd 30b3 30f3 300c 30de 30c3 30af ff08 30de 30c3 30ad 30f3 30c8 30c3 30b7 30e5 ff09 300d 3092 3001 3053 3088 306a 304f 611b 3059 308b 4eba 305f 3061 306e 3053 3068 3060 3002 300c 30a2 30c3 30d7 30eb 4fe1 8005 300d 306a 3093 3066 8a00 3044 65b9 307e 3067 3042 308b 3002

Compressed (178 bytes)

08 00 1b 4c ea 16 ca d3 94 0f 53 ef 61 1b e5 84 c4 0f 53 ef 61 1b e5 84 c4 16 ca d3 94 08 02 0f 53 4a 4e 16 7d 00 30 82 52 4d 30 6b 6d 41 88 4c e5 97 9f 08 0c 16 ca d3 94 15 ae 0e 6b 4c 08 0d 8c b4 a3 9f ca 99 cb 8b c2 97 cc aa 84 08 02 0e 7c 73 e2 16 a3 b7 cb 93 d3 b4 c5 dc 9f 0e 79 3e 06 ae b1 9d 93 d3 08 0c be a3 8f 08 88 be a3 8d d3 a8 a3 97 c5 17 89 08 0d 15 d2 08 01 93 c8 aa 8f 0e 61 1b 99 cb 0e 4e ba 9f a1 ae 93 a8 a0 08 02 08 0c e2 16 a3 b7 cb 0f 4f e1 80 05 ec 60 8d ea 06 d3 e6 0f 8a 00 30 44 65 b9 e4 fe e7 c2 06 cb 82

9.4 All Features

The following sample compressed string contains all the features of the compression scheme, but limited to only representative instances of the eight SQn and the seventeen SCn/UCn, SDn/UDn, and SDX/UDX pairs. The text is repeated to demonstrate how the same substring can yield different compressed strings.

UTF-16 code units (20 code units, 18 characters)

0041 00df 0401 015f 00df 01df f000 dbff dfff 000d 000a 0041 00df 0401 015f 00df 01df f000 dbff dfff

Compressed (35 bytes)

41 df 12 81 03 5f 10 df 1b 03 df 1c 88 80 0b bf ff ff 0d 0a 41 10 df 12 81 03 5f 10 df 13 df 14 80 15 ff

10 Possible Private Extensions (informative)

During the design and review phase of the compression scheme, extensions were repeatedly suggested to handle the two following situations. Although these extensions were not accepted as part of the compression scheme itself, it was felt useful to document them here. While they do not form part of SCSU, they are examples of how certain problems could be solved by adding higher level protocols, for use by consenting parties.

10.1 Avoiding Control Byte Values

With a simple re-mapping, the SCSU encoded data stream can be made free of most control byte values so that it can be passed where ASCII text is expected. This re-mapping is not as costly as more general schemes for converting binary data to text and leaves the text parts of compressed Latin-1 text fully readable.

After encoding, replace any control byte by DLE (0x10) followed by the original byte + 0x40. NUL becomes DLE followed by '@' (0x40). DLE is replaced by DLE followed by U+0050. Before decoding, perform the opposite transformation.

10.2 Handling Runs of the Same Character

Longer runs of the same character allow additional compression. Since this is not common in the general case it was omitted from the standard algorithm. For situation where sender and receiver can agree on the additional specification and where runs are common, the following is a suggested method.

Before encoding, replace any run of 4 or more Unicode characters by '@' (U+0040), followed by the character to repeat, followed by a 16-bit count (packed into one Unicode character). The sequence of 33 hyphens --------------------------------- becomes '@' '-' '!' (0x40, 0x2D, 0x21). Any occurrence of @ sign by itself is replaced by @@U+0001. After decoding, perform the reverse operation.

References

[BOCU]	BOCU-1: MIME-Compatible Unicode Compression http://www.unicode.org/notes/tn6/ Binary Ordered Compression for Unicode (BOCU)
[FAQ]	Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues.
[Feedback]	Reporting Errors and Requesting Information Online http://www.unicode.org/reporting.html
[Glossary]	Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents.
[ICU]	International Components for Unicode http://oss.software.ibm.com/icu/
[Reports]	Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports.
[Unicode]	The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1.
[Versions]	Versions of the Unicode Standard http://www.unicode.org/standard/versions/ For details on the precise contents of each version of the Unicode Standard, and how to cite them.

Acknowledgements

The authors would like to thank Dr. Laura Wideburg for assistance in copy editing. Thanks to David Pope, Doug Ewell and Roman Czyborra for bug reports. Markus Scherer proposed the signature sequence for SCSU. David Starner suggested a section on worst-case behavior.

Authors

The original concept of a standard compression scheme for Unicode was implemented at Reuters and proposed by Misha Wolf and Charles Wicksteed. Extensions and refinements were proposed by Mark Davis, Ken Whistler and Martin Duerst. The final text for the Technical Report and the sample implementations were created by Asmus Freytag. The Technical Report is now maintained by Markus Scherer.

Revisions

Note: none of the fixes imply a change to the specification.

Modifications

The following summarizes modifications from the previous version of this document.

3.6

Added 8.7 Minimal Encoder and the scsumini.c sample code.

3.5

Added recommendation to remain in Single Byte Mode for initial Latin-1 text, and an informative section about the resulting XML suitability.

1.0 - 3.4

1. Russian uses SC2 instead of SC7 as claimed in the examples.

2. The 'All Features' example has been corrected.

3. A new Japanese example has been added.

4. Changed Table 3 from

68..A7 x*80+AE00 half-blocks from U+E000 to U+FF80

to

68..A7 x*80+AC00 half-blocks from U+E000 to U+FF80

to match the correct value used in the sample code.

5. Corrected 1FFF to 1F in the offset calculation equation for defining extended windows.

6. Corrected a few minor typographical errors [6/5/99].

7. Corrected dynamic offset in for Window 1 in sample code to 0x00C0 to match Table 5 of specification (updated internal version number of SCSU.java to 005 and commented changed source line).

8. Changed methods in the expander from private to protected to support a minor update of the driver program. (Updated internal version number to 005 in Expand.java and added a comment).

9. Minor improvements to the driver program. (Updated internal version number to 005 in CompressMain.java)

10. Editorial reformatting. [11/12/99]

11. Added the section on use of signature and changed version to 3.1 (The sample programs have not been updated to implement this recommendation).

12. Fixed HTML validation error. [3/11/00]

13. Added an informative section on worst-case behavior [10/31/01].

14. Changed references to 'expansion space' to 'supplementary coding space', to be more in line with terminology introduced in Unicode 3.1.

15. Clarified that the "Unicode" data in Unicode Mode is UTF-16BE. This clarification is necessary since later versions of the Unicode Standard add UTF-8 and UTF-32 on an equal basis.

16. Clarified that SCSU is an encoding of a sequence of code points, independent of the encoding form. This makes no change to the specification, since nothing in the original wording required the uncompressed data to be in UTF-16.

17. Clarified that SQU and UQU may only be applied to characters on the BMP, which are represented by two bytes in SCSU.

18. In 6.2.1, corrected

Static window 0 is also used when bytes following an SCn or UCn are in the range 80 to FF.

Static window 0 is also used when bytes following an SCn or UCn are in the range 20 to 7F.

19. Corrected the example in section 10.2.

20. Changed styles and template.

21. Added section 2.3 to discuss limitations of SCSU. Added references. [05/08/02]

22. Changed "Unicode Values" to "code points" and made similar clarifications throughout.

Added restriction to remain in Single Byte Mode for initial Latin-1 text, and an informative section about the resulting XML suitability.

Copyright © 1999-2004 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.

Proposed Update Unicode Technical Standard #6 L2/04-020

A Standard Compression Scheme for Unicode

Summary

Status

Contents

1 Scope

2 Description

2.1 Compression Scheme for Unicode

2.2 Encoders and Decoders

2.3 Limitations

3 Definitions

4 Conformance

5 Compression

5.1 Single Byte Mode

Table 1. Tags for use in Single-byte Mode

5.2 Unicode Mode

5.2.1 Quoting in Unicode mode

Table 2. Tags for use in Unicode mode

6 Windows

6.1 Dynamically Positioned Windows

6.1.1 Locking Shifts (Dynamically positioned windows only)

6.1.2 Window Positioning

6.1.2 Table 3. Window Offset Table

6.1.3 Extended Windows

6.2 Non-locking Shifts and Static Windows

6.2.1 Static Windows

Table 4. Static Window Positions

6.2.2 Use of SQ0

7 Initial State

7.1 Initial Window Settings

Table 5. Default Positions for Dynamically Positioned Windows

8 Notes

8.1 Surrogate Pairs

8.2 Private Use Area

8.3 Tag Allocation

Table 6. Single Mode Tag Values

Table 7. Unicode Mode Tag Values

8.4 Signature Byte Sequence for SCSU (informative)

8.5 Worst Case Behavior (informative)

8.6 XML Suitability (informative)

8.7 Minimal Encoder (informative)

9 Examples (informative)

9.1 German

9.2 Russian

9.3 Japanese

9.4 All Features

10 Possible Private Extensions (informative)

10.1 Avoiding Control Byte Values

10.2 Handling Runs of the Same Character

References

Acknowledgements

Authors

Revisions

Modifications

Proposed Update Unicode Technical Standard #6
L2/04-020