[Unicode]  Technical Reports
 

Working Draft Unicode Technical Standard #99

L2/04-016

BOCU-1:
MIME-Compatible Unicode Compression

Version 2
Authors Markus W. Scherer, Mark Davis
Date 2004-01-01
This Version http://www.unicode.org/reports/tr99/tr99-2.html (http://www.mindspring.com/~markus.scherer/unicode/tr-bocu/tr-bocu-20040123.html)
Previous Version http://www.unicode.org/notes/tn6/tn6-1.html
Latest Version http://www.unicode.org/reports/tr99
Base Unicode Version Unicode 2.0.0


Summary

This document describes a Unicode compression scheme that is MIME-compatible (directly usable for emails) and preserves binary order (for databases and sorted lists). BOCU-1 is a MIME-compatible application of the Binary Ordered Compression for Unicode [BOCU] base algorithm.

Status

This document is a proposed draft Unicode Technical Standard (from Unicode Technical Note #6). Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS. Each UTS specifies a base version of the Unicode Standard. Conformance to the UTS requires conformance to that version or higher.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in the References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents

  1. Introduction
    1. Generic BOCU

    2. Customizations for BOCU-1

  2. Features of BOCU-1
    1. Compression
    2. Binary Order
    3. MIME Compatibility
    4. Details
  3. Signature Byte Sequence
  4. Encoding Algorithm
    1. Difference Encoding
    2. Decoding
  5. Sample C Code
  6. BOCU-1 Performance
    1. SCSU vs. BOCU-1
    2. Test Setup
  7. References
  8. Acknowledgments
  9. Modifications

1 Introduction

The most popular encoding for Unicode text data exchange is UTF-8. It is relatively simple and widely applicable: MIME/email/HTML/XML, in-process use in some systems, etc. However, UTF-8 uses more bytes per code point for non-Latin scripts than language-specific codepages.

In some markets where scripts other than Latin are used, the high bytes/code point ratio of UTF-8 (and of UTF-16 for scripts other than Latin and CJK ideographs and Korean Hangul) has been criticized and used as a motivation to not use Unicode for the exchange of documents in some languages. Larger document sizes are also a problem in low-bandwidth environments such as wireless networks for handheld systems.

The Standard Compression Scheme for Unicode [SCSU] was created as a Unicode compression scheme with a byte/code point ratio similar to language-specific codepages. It has not been widely adopted although it fulfills the criteria for an IANA charset and is registered as one. [SCSU] is not suitable for MIME “text” media types, i.e., it cannot be used directly in emails and similar protocols. [SCSU] requires a relatively complicated encoder design for good performance (see there for details about small vs. full encoders).

BOCU-1 combines the wide applicability of UTF-8 with the compactness of [SCSU]. It is useful for short strings and maintains code point order.

1.1 Generic BOCU

Binary Ordered Compression for Unicode [BOCU] employs a kind of difference encoding tailored to Unicode text. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF to 10FFFF. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.

Compression is achieved because normal Unicode text contains sequences of code points from the same script block. Small alphabets are allocated in blocks of 128 code points, so that the difference between two code points from such a block can be encoded with a single byte. Large script blocks like CJK Unihan are covered with 2-byte-encoded differences.

In order to minimize the absolute values of differences and thus improve the compression, the previous code point is adjusted to the middle of a script block. This covers a 128-block with differences of -64..+63 around the center code point instead of requiring differences of -127..+127 without the adjustment. A special range check for the larger CJK Unihan block ensures that it is covered with differences of about +-10500 which can be encoded with 2 bytes each.

BOCU preserves binary code point order by allocating lead bytes for encoded differences such that increasing lead byte values correspond to increasing difference values. Because signed differences are encoded, the center of the lead byte range is used for a zero difference, and lead bytes for positive and negative differences are symmetrically arranged around it, with the lead byte values farther from the middle for larger absolute differences.

1.2 Customizations for BOCU-1

BOCU-1 differs from the generic algorithm mainly by using a different set of byte value ranges to achieve MIME compatibility. It encodes U+0000..U+0020 directly with byte values 00..20 and does not use the ASCII byte values for common C0 control codes (especially CR and LF) for anything else but encoding those control codes. In addition, the space character U+0020 does not affect the state. This is to avoid large difference values at the beginning and end of each word.

As a result, only the code points above U+0020 are encoded with multi-byte sequences for their differences, and the lead byte range for these sequences starts at 0x21. Some byte values below 0x20 corresponding to rarely-used C0 control codes are also used as trail bytes in multi-byte sequences.

BOCU-1 uses FF as a special reset byte; see 2.4 Details.

2 Features of BOCU-1

2.1 Compression

The byte/code point ratio is 1 for runs of code points from the same block of 0x80 code points (and for Hiragana), and 2 for runs of CJK Unihan code points, as with [SCSU]. All characters up to U+2DD4B, which includes almost all assigned Unicode characters, will always be encoded with at most 3 bytes each. The startup overhead is very low (similar to [SCSU]), which makes it useful for very short strings like individual names. The maximum number of bytes per code point is four.

With BOCU-1, texts in languages with Latin-based scripts take about the same amount of space as with UTF-8, while texts in all other languages take about 25%-60% less space. Compared to UTF-16, texts in all languages with small character repertoires take approximately half as much space in BOCU-1. For large character sets, i.e., Chinese/Japanese/Korean, texts are about the same size for UTF-16 and BOCU-1. 

2.2 Binary Order

The lexical order of BOCU-1 bytes is the same as the code point order of the original text — like UTF-8 but unlike [SCSU] — which allows the compression of large, sorted lists of strings. This makes BOCU-1 suitable for use in databases to reduce storage requirements as described above.

2.3 MIME Compatibility

The C0 control codes NUL, CR and LF and nine others are encoded with the same byte values as in US-ASCII, and those byte values are used only for the encoding of these twelve C0 control codes. This makes BOCU-1 suitable for MIME “text” media types, directly usable in emails and generally “friendly” for ASCII-based tools. The SUB control and its byte value 1A is included in this set to avoid problems in DOS/Windows/OS/2 systems where 1A is interpreted as an end-of-file marker in text mode.

2.4 Details

BOCU-1 is byte-oriented and stateful. Its inter-character state consists of a single integer. It is deterministic, i.e., the same complete input text is always encoded the same way by all encoders (unlike with [SCSU]). Like [SCSU] it can be classified as a Character Encoding Scheme (CES) or as a Transfer Encoding Syntax (TES). It is a “charset” according to IANA, and it is suitable for MIME “text” media types.

The state is reset at each C0 control (U+0000..U+001F, includes CR, LF, TAB). CRLF-separated lines do not affect each other’s encoding. Together with BOCU-1 being deterministic, this allows line-based file comparisons (diff) and makes BOCU-1 usable with RCS, CVS and other source-code control systems (unlike [SCSU]). This also allows some limited random access.

Byte values for single-byte codes and lead bytes overlap with trail bytes. So unlike UTF-8, character boundaries cannot be determined in random access, except by backing up to a reset point. Byte values 7F..9F (DEL and C1 control codes) are used as lead and trail bytes. US-ASCII characters (code points U+0021..U+007E) are not encoded with the same bytes as in US-ASCII. Therefore, the charset must be specified with a signature byte sequence or in a higher-level protocol.

As a single/lead byte, byte value FF is used as a special “reset-state-only” command. It does not encode any code point. FF is also a normal trail byte. Having a “reset only” command allows simple concatenation of BOCU-1 byte streams. (All other BOCU-1 byte sequences would append some code point.) Using FF to reset the state breaks the ordering and the deterministic encoding! The use of FF resets is discouraged. Byte stream concatenation without resetting with FF requires to scan back to a C0 control whose byte value is not used for trail bytes (last known reset to initial state); then decode to the end of the first stream and encode the first non-U+0020 code point of the second stream according to the current state; then append the rest of the second stream. The same procedure could be used to remove an FF reset command.

3 Signature Byte Sequence

An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.

4 Encoding Algorithm

The basic algorithm is as described in Binary Ordered Compression for Unicode [BOCU].

BOCU-1 differs from the generic algorithm by using a different set of byte value ranges and by encoding U+0000..U+0020 directly with byte values 00..20. In addition, the space character U+0020 does not affect the state. This is to avoid large difference values at the beginning and end of each word.

BOCU-1 encodes exactly the Unicode code points U+0000..U+10FFFF. Single surrogates are encoded if present in the input (e.g., unmatched single surrogate code units in 16-bit Unicode strings). Proper input of supplementary code points (e.g., matched surrogate pairs in UTF-16) must be encoded by code points.

Partial pseudo-code for a per-code point encoding function is as follows:

// inter-code point state variable to be passed into the encode() function
// previous code point, adjusted according to the code below
static int prev=0x40;   // initial setting in the middle of ASCII

// pass the "adjusted previous code point" state variable
// and the next Unicode code point;
// outputs the bytes for the code point and sets prev accordingly
encode(int &prev, int c) {
    if(c<=0x20) {       // space or C0 control
        output (byte)c; // output the character in its ASCII encoding
        if(c!=0x20) {   // do not change the state for a space
            prev=0x40;
        }
    } else {
        int diff=c-prev;
        // encode diff in 1..4 bytes and output them

        // adjust prev
        // usually, set it to the middle of a 128-block of code points
        // which works well with the assignment pattern of scripts in Unicode;
        // use custom adjustments for Hiragana, which is not 128-aligned,
        // and for CJK Unihan and Hangul syllables, which have much larger blocks
        if(0x3040<=c<=0x309f) {         // c is Hiragana
            prev=0x3070;                // middle of Hiragana
        } else if(0x4e00<=c<=0x9fa5) {  // c is CJK Unihan
            prev=0x7711;                // middle of CJK Unihan
                                        // (0x4e00+half of reach of 2-byte difference)
        } else if(0xac00<=c<=0xd7a3) {  // c is Hangul
            prev=0xc1d1;                // middle of Hangul=(0xd7a3+0xac00)/2
        } else {
            prev=(c&~0x7f)+0x40;        // middle of a 128-block
        }
    }
}

4.1 Difference Encoding

Any code point above U+0020 (space) is encoded with a multi-byte sequence representing the difference between the code point value and the prev state variable (see the diff variable in the pseudo-code above). The sequence is 1 to 4 bytes long; it is longer for larger absolute values of the difference. Consecutive code points from the same 128-block are encoded with single bytes each.

Most byte values are used for both lead/single bytes and trail bytes, which allows to encode large integers with few bytes but makes BOCU-1 unsuitable for text processing (as opposed to storage and data transfer). An exception is that BOCU-1 encoded text can be efficiently compared in Unicode code point order because that order is preserved in the lexical order of BOCU-1 byte streams (as long as the FF reset byte is not used).

The following table shows the use of each byte value. All but 13 byte values are used as trail bytes and are assigned “digit” values in ascending order.

Table 1: Byte value assignments

Byte value Lead/single byte Trail byte “digit” values
FF reset prev to 0x40, does not encode any code point 14..F2 (20..242)
=byte value - 0D (13)
FE lead byte for 4-byte sequences for positive differences
FB..FD lead bytes for 3-byte sequences for positive differences
D0..FA lead bytes for 2-byte sequences for positive differences
50..CF single bytes for differences -0x40..+0x3F
25..4F lead bytes for 2-byte sequences for negative differences
22..24 lead bytes for 3-byte sequences for negative differences
21 lead byte for 4-byte sequences for negative differences
20 single byte for space (U+0020), does not change the state -
1C..1F single byte for U+0000..U+001F 10..13 (16..19)
1A..1B -
10..19 06..0F (6..15)
07..0F -
01..06 00..05
00 -

Differences of -0x40..+0x3F (-64..+63) are encoded with single bytes 0x50..0xCF by adding 0x90.

Differences outside of this range are encoded in a way similar to decimal or hexadecimal numbers, but with a number base of 243 (=256-13) instead of 10 or 16. Lead bytes (“lead digits”) are chosen such that byte sequences for lower difference values sort lexically before those for higher difference values.

Encode them with the following steps:

  1. Determine values for the encoding of the difference value according to table 2 below:
    1. The number of trail “digit” bytes.
    2. The base lead byte value.
    3. A difference offset: For a positive difference value, the lowest difference value that can be encoded with the same number of trail “digits”; for a negative difference value, the lowest difference value that can be encoded with one fewer trail “digit”.
  2. Subtract the difference offset (step 1c) from the current difference value. This ensures that each difference value has only one possible encoding (avoiding a shortest-form problem) and extends the ranges of difference values encodable with a given number of bytes.
  3. For each trail “digit” from last to first:
    1. Calculate the quotient and remainder (t) of dividing the difference value (diff) by 243 (the number of possible trail byte values).
      • quotient=diff/243
      • t=diff%243 (modulo, division remainder)
      • Make sure that the modulo/division operations always yield non-negative modulo values and that the modulo and quotient values together compute back to the original difference value according to diff=(quotient*243)+t; adjust if necessary by adding 243 to t and subtracting 1 from the quotient.
    2. The trail “digit” value is the division remainder t. Encode it with a trail byte according to table 1 above.
    3. Continue with the quotient (which will always be negative if the original difference was negative): diff=quotient
  4. Encode the lead byte by adding the base lead byte value from step 1 to the final diff value (the last quotient) from step 4.
  5. Output the lead and trail bytes.

Table 2. Lookup and sample values for the difference encoding.

Diff. value,
hexadecimal
Decimal Final bytes Number of
trail bytes
Base lead
byte value
Diff. offset,
hexadecimal
Decimal
10FFFF 1114111 FE 19 B4 94 3 FE 2DD0C 187660
2DD0C 187660 FE 01 01 01
2DD0B 187659 FD FF FF 2 FB 2911 10513
2911 10513 FB 01 01
2910 10512 FA FF 1 D0 40 64
40 64 D0 01
3F 63 CF 0 90 0 0
1 1 91
0 0 90
-1 -1 8F
-40 -64 50
-41 -65 4F FF 1 50 -40 -64
-2911 -10513 25 01
-2912 -10514 24 FF FF 2 25 -2911 -10513
-2DD0C -187660 22 01 01
-2DD0D -187661 21 FF FF FF 3 22 -2DD0C -187660
-10FFFF -1114111 21 F0 58 79

4.2 Decoding

A BOCU-1 byte stream is decoded (converted back to Unicode text) by reversing the encoding algorithm. Briefly:

While there is no illegal lead/single byte, there are illegal trail bytes (see table 1). Multi-byte sequences that result in code points outside of the Unicode range U+0000..U+10FFFF are illegal. Recovery from illegal input is not specified.

5 Sample C Code

The sample C code serves as the full specification of BOCU-1. Every conformant encoder and decoder must generate equivalent output and detect any illegal input code points and illegal input byte sequences. Recovery from illegal input is not specified. Single surrogates are encoded if present in the input (e.g., unmatched single surrogate code units in UTF-16). Proper input of supplementary code points (e.g., matched surrogate pairs in UTF-16) must be encoded by code points.

This code uses International Components for Unicode [ICU] standard headers and the one implementation file icu/source/common/utf_impl.c. (It is not necessary to link the entire [ICU] common library.) This is for convenience in the surrounding test functions and not necessary for the core BOCU-1 functions. These headers and implementation file provide the following:

This code is under the X license (ICU version) [ICULicense].

Files:

A complete, compiled sample executable for Windows® from this source code is available for download. Aside from basic implementation and consistency tests, this also provides file conversion between UTF-8 and BOCU-1. Use a command-line argument of “?” or “-h” for usage.

6 BOCU-1 Performance

This is a performance comparison between BOCU-1, UTF-8, and [SCSU] when implemented as [ICU] converters, which convert to and from UTF-16. The time is for roundtrip conversions from the internal UTF-16 form to the encoding and back. Values are relative to UTF-8.

BOCU-1 SCSU
Languages Size of text Time to convert
to/from UTF-16
Size of text Time to convert
to/from UTF-16
English/French 100% 160..170% 100% 125%
Greek/Russian/Arabic/Hebrew 60% 65..70% 55% 70%
Hindi 40% 45% 40% 45%
Thai (see below) 45% 60% 40% 55%
Japanese 60% 150% 55% 110%
Korean 75% 155% 85% 70%
Chinese 75% 165% 70% 65%

(Smaller percentages are better. Percentages are rounded to the nearest 5%.)

The compression ratio is smaller for web pages (lots of ASCII in HTML). The performance difference tends to be smaller for smaller buffers. When the text is transmitted between machines (emails, web pages), then the transmission time may swamp the conversion time. Smaller text will then transmit faster.

6.1 SCSU vs. BOCU-1

6.2 Test Setup

The texts are the “What is Unicode” pages from www.unicode.org, except for Thai. Note that english.html contains non-ASCII characters in the index sidebar. The Thai text, th18057.txt, has a different structure: It is a Thai word list from [ICU]’s test data, with one Thai word on each line. Compared to the other texts, it contains only a few characters between CRLF.

This comparison uses full-fledged [ICU] converters for UTF-8, [SCSU] and BOCU-1. “Full-fledged [ICU] converter” means that this is with the [ICU] conversion API, designed for external encodings, as used e.g. by an XML parser or web browser.

The [ICU] converter code for [SCSU] and BOCU-1 that is tested here is part of [ICU] 2.2. The [SCSU] converter was optimized slightly (conversion function variants without offset handling). The BOCU-1 converter is optimized compared to the reference code in the design document. The test machine is a Pentium 2 laptop.

References

[BOCU] Binary Ordered Compression for Unicode
http://oss.software.ibm.com/icu/docs/papers/binary_ordered_compression_for_unicode.html
Describes the base algorithm for BOCU-1.
[ICU] International Components for Unicode
http://oss.software.ibm.com/icu/
Open-source C/C++ and Java libraries for Unicode and I18N support.
[ICULicense] The X license, modified for ICU
http://oss.software.ibm.com/cvs/icu/~checkout~/icu/license.html
[SCSU] Unicode Technical Standard #6: A Standard Compression Scheme for Unicode
http://www.unicode.org/reports/tr6/
[Versions] Versions of the Unicode Standard
http://www.unicode.org/standard/versions/
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.
 

Acknowledgments

Thanks to Doug Ewell for valuable feedback on this document.

Modifications

The following summarizes modifications from the previous version of this document.

2 Changed into a Unicode Technical Standard; added a formal description. The sample C code has minor fixes for 16-bit compilers.
1 Initial version

Copyright © 2002-2004 Unicode, Inc., Markus W. Scherer, Mark Davis. All Rights Reserved. The Unicode Consortium and authors make no expressed or implied warranty of any kind, and assume no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.