[Unicode]  Technical Reports
 

Unicode Technical Report #26

Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8)

AuthorsEditors Rick McGowanToby Phipps (tphipps@peoplesoft.com)
Date 2011-12-09
This Version http://www.unicode.org/reports/tr26/tr26-4.html
Previous Version http://www.unicode.org/reports/tr26/tr26-3.html
Latest Version http://www.unicode.org/reports/tr26/
Latest Proposed Update http://www.unicode.org/reports/tr26/proposed.html
Version Revision 4


Summary

This document specifies an 8-bit Compatibility Encoding Scheme for UTF-16 (CESU) that is intended for internal use within systems processing Unicode in order to provide an ASCII-compatible 8-bit encoding that is similar to UTF-8 but preserves UTF-16 binary collation. It is not intended nor recommended as an encoding used for open information exchange. The Unicode Consortium, does not encourage the use of CESU-8, but does recognize the existence of data in this encoding and supplies this technical report to clearly define the format and to distinguish it from UTF-8. This encoding does not replace or amend the definition of UTF-8.

Status

This is a draft document which may be updated, replaced, or superseded by other documents at any time. Publication does not imply endorsement by the Unicode Consortium. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.

A Unicode Technical Report (UTR) contains informative material. Conformance to the Unicode Standard does not imply conformance to any UTR. Other specifications, however, are free to make normative references to a UTR.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents


1 Introduction

CESU-8 defines an encoding scheme for Unicode identical to UTF-8 except for its representation of supplementary characters. In CESU-8, supplementary characters are represented as six-byte sequences resulting from the transformation of each UTF-16 surrogate code unit into an eight-bit form similar to the UTF-8 transformation, but without first converting the input surrogate pairs to a scalar value.

CESU-8 is useful in 8-bit processing environments where binary collation with UTF-16 is required. It is designed and recommended for use only within products requiring this UTF-16 binary collation equivalence. It is not intended nor recommended for open interchange. 

The following lists the important features of this encoding form: 

As a very small percentage of characters in a typical data stream are expected to be supplementary characters, there is a strong possibility that CESU-8 data may be misinterpreted as UTF-8. Therefore, all use of CESU-8 outside closed implementations is strongly discouraged, such as the emittance of CESU-8 in output files, markup language or other open transmission forms.

2 Definitions

The following define the CESU-8 encoding scheme. CESU-8 is not a normative part of the Unicode Standard, and therefore the definitions below do not form part of the standard. Instead, they are encapsulated in this Unicode Technical Report as an implementation-specific encoding scheme for use by implementers of the Unicode Standard. 

2.1 Encoding

CESU-8 is a Compatibility Encoding Scheme for UTF-16 (CESU) that serializes a Unicode code point as a sequence of one, two, three or six bytes. 

  1. Prior to transforming data into CESU-8, supplementary characters must first be converted to their surrogate pair UTF-16 representation. For example, U+F0000 must first be converted to U+DB80 U+DC00.

  2. The resulting data stream is encoded into an eight-bit form using the bit distribution table in definition 2.2. It should be noted that this bit distribution table is identical to that of UTF-8 except that the input value is a sequence of UTF-16 code units, not a scalar value, and that a four-byte transformation is disallowed.

  3. The bit pattern 1111xxxx is illegal in any CESU-8 byte, effectively prohibiting the occurrence of UTF-8 four-byte surrogates in CESU-8. Thus, a data stream may not contain both CESU-8 six-byte and UTF-8 four-byte supplementary character sequences.

  4. The shortest form rules applied to UTF-8 in the Unicode Standard Definition D36 will also apply to CESU-8. 

CESU-8 encoding example: 

In CESU-8, <U+004D, U+0061, U+F0000> is serialized as
<4D 61 ED AE 80 ED B0 80>

2.2 CESU-8 Bit Distribution

UTF-16 Code Unit

1st Byte

2nd Byte

3rd Byte

000000000xxxxxxx

0xxxxxxx

 

 

00000yyyyyxxxxxx

110yyyyy

10xxxxxx

 

zzzzyyyyyyxxxxxx 

1110zzzz 

10yyyyyy 

10xxxxxx

 

3 Identification of CESU-8

Data encoded in CESU-8 should only be exchanged when it is labeled as such in a higher-level protocol or is agreed upon in an API definition. It should not be auto-detected. Use of this encoding in the absence of encoding tags or a higher level protocol describing the encoding is invalid and strongly discouraged. See also IANA Registration.

NOTE: Due to their apparent similarity in structure, implementers need to take stronger than usual precautions that CESU-8 data are not inadvertently misidentified as UTF-8 and vice versa. See also Relation to ISO/IEC 10646 and UTF-8.

4 Relation to ISO/IEC 10646 and UTF-8

ISO/IEC 10646 and the Unicode Standard define the UTF-8 encoding form, which is very similar in definition to CESU-8 other than its treatment of supplementary characters. CESU-8 is a different encoding scheme. It does not form part of either ISO/IEC 10646 or the Unicode Standard. It is intended only for use in compatibility situations where binary collation with UTF-16 is required.

5 IANA Registration

CESU-8 has been registered in the Internet Assigned Numbers Authority (IANA) Character Set registry with the following properties:

Name: CESU-8
MIBenum: 1016
Alias: csCESU-8

Note: CESU-8 was originally proposed and discussed with the name UTF-8S, but was renamed CESU-8 to avoid possible confusion with UTF-8.

Acknowledgements

Toby Phipps was the author of earlier versions of this report. Thanks to Jianping Yang, Nobuyoshi Mori, Asmus Freytag, Markus Scherer and Kent Karlsson for their feedback and input on this document.

References

[FAQ] Unicode Frequently Asked Questions
http://www.unicode.org/faq/
For answers to common questions on technical issues.
[Feedback] Reporting Errors and Requesting Information Online
http://www.unicode.org/reporting.html
[Glossary] Unicode Glossary
http://www.unicode.org/glossary/
For explanations of terminology used in this and other documents.
[Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[Unicode] The Unicode Standard
For the latest version, see:
http://www.unicode.org/versions/latest/
For the 6.0.0 version, see:
http://www.unicode.org/versions/Unicode6.0.0/
[Versions] Versions of the Unicode Standard
http://www.unicode.org/standard/versions/
For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.

Modifications

The following summarizes modifications from the previous version of this document.

Revision 4 [RM]

Revision 3