L2/01-328R3
Proposed Draft Unicode Technical Report #26
COMPATIBILITY
enCODING sCHEME for utf-16 – 8-BIT
(CESU-8)
Summary
This
document specifies a 8-bit Compatibility Encoding Scheme for UTF-16 (CESU) that
is intended as an alternate encoding to UTF-8 for internal use within systems
processing Unicode in order to provide a ASCII-compatible 8-bit encoding that
preserves UTF-16 binary collation. It
is not intended nor recommended as an encoding used for open information
exchange. The Unicode Consortuim,
does not encourage the use of CESU-8, but does recognize the existence of data
in this encoding and supplies this Technical Report to clearly define the
format and to distinguish it from UTF-8.
This encoding does not replace or amend the definition of UTF-8.
Status
This
document has been approved by the Unicode Technical Committee for public review
as a Proposed Draft Unicode Technical Report. Publication does not imply
endorsement by the Unicode Consortium. This is a draft document which may be
updated, replaced, or superseded by other documents at any time. This is not a
stable document; it is inappropriate to cite this document as other than a work
in progress.
A list of
current Unicode Technical Reports is found on http://www.unicode.org/
unicode/reports/. For more information about versions of the Unicode
Standard, see http://www.un
icode.org/unicode/standard/versions/.
Please mail corrigenda and other comments to the author(s).
Contents
1 Introduction
CESU-8
defines an encoding scheme for Unicode identical to UTF-8 except for its
representation of supplementary characters.
In CESU-8, supplementary characters are represented as six-byte
sequences resulting from the transformation of each UTF-16 surrogate code unit
into an eight-bit form similar to the UTF-8 transformation, but without first
converting the input surrogate pairs to a scalar value.
CESU-8 is
useful in 8-bit processing environments where binary collation with UTF-16 is
required. It is designed and recommended for use only within products requiring
this UTF-16 binary collation eqivalence. It is not intended nor recommended for
open interchange.
The following
lists the important features of this encoding form:
- The CESU-8 representation of
characters on the Basic Multilingual Plane (BMP) is identical to the
representation of these characters in UTF-8. Only the representation of
supplementary characters differs.
- Only the six-byte form of
supplementary characters is legal in CESU-8; the four-byte UTF-8 style
supplementary character sequence is illegal.
- When supplementary characters
are present, a data stream can be unequivocally determined as being
encoded in UTF-8 or CESU-8 based on the representation of these
supplementary characters. When
encoding information is not present and encoding autodetection is
attempted, if the data stream consists of well-formed UTF-8 and does not
contain supplementary characters, it should always be detected as UTF-8,
not CESU-8 (even though the two encodings are identical when supplementary
characters are not present).
- A binary collation of data
encoded in CESU-8 is identical to the binary collation of the same data
encoded in UTF-16.
As a very
small percentage of characters in a typical data stream are expected to be
supplementary characters, there is a strong possibility that CESU-8 data may be
misinterpreted as UTF-8. Therefore, all use of CESU-8 outside closed
implementations is strongly discouraged, such as the emittance of CESU-8 in
output files, markup language or other open transmission forms.
The following
define the CESU-8 encoding scheme.
CESU-8 is not a normative part of The Unicode Standard, and therefore
the definitions below do not form part of the standard. Instead, they are encapsulated in this
Unicode Technical Report as an implementation-specific transformation form for
use by implementors of The Unicode Standard.
2.1
|
(a) CESU-8 is a Compatibility Encoding Scheme for UTF-16 (CESU) that
serializes a Unicode code point as a sequence of one, two, three or six
bytes.
(b) Prior to transforming data into CESU-8, supplementary characters must
first be converted to their surrogate pair UTF-16 representation. For example, U+F0000 must first be
converted to U+DB80 U+DC00.
(c) The resulting data stream is encoded into an eight-bit form using the bit
distribution table in definition 2.2. It should be noted that this bit
distribution table is identical to that of UTF-8 except that the input value
is a sequence of UTF-16 code units, not a scalar value, and that a four-byte
transformation is disallowed.
(d) The bit pattern 11110xxx is illegal in any CESU-8 byte, effectively
prohibiting the occurrence of UTF-8 four-byte surrogates in CESU-8. Thus, a data stream may not contain both
CESU-8 six-byte and UTF-8 four-byte supplementary character sequences.
(e) The shortest form rules applied to UTF-8 in The Unicode Standard
Definition D36 will also apply to CESU-8.
(f) Data encoded in CESU-8 should only be exchanged when it is labeled as
such in a higher-level protocol or is agreed upon in an API definition. It should not be auto-detected. Use of this encoding in the absence of
encoding tags or a higher level protocol describing the encoding is invalid
and strongly discouraged.
- CESU-8 encoding example:
In CESU-8, <U+004D, U+0061, U+10000> is serialized as <4D 61 ED
AE 80 ED B0 80>
|
2.2
|
CESU-8 Bit Distribution
UTF-16 Code Unit
|
1st Byte
|
2nd Byte
|
3rd Byte
|
000000000xxxxxxx
|
0xxxxxxx
|
|
|
00000yyyyyxxxxxx
|
110yyyyy
|
10xxxxxx
|
|
zzzzyyyyyyxxxxxx
|
1110zzzz
|
10yyyyyy
|
10xxxxxx
|
|
3 Relation to ISO/IEC 10646 and UTF-8
ISO/IEC 10646
and The Unicode Standard define the UTF-8 encoding form, which is very similar
in definition to CESU-8 other than its treatment of supplementary characters. CESU-8
is an additional encoding scheme that supplements these definitions, but does
not form part of either ISO/IEC 10646 or The Unicode Standard. It is intended
only for use in compatibility situations where binary collation with UTF-16 is
required.
CESU-8 will
be registered with the Internet Assigned Numbers Authority. This section will be updated with the IANA
registered name.
Note: CESU-8 was originally proposed and discussed with the name
UTF-8S, but was renamed CESU-8 by recommendation from the Unicode Technical
Committee to avoid possible confusion with UTF-8.
The following
summarizes modifications from the previous version of this document.
1
|
- Created with amendments from
proposed draft as approved by UTC#88.
|
Copyright © 1999-2001 Unicode, Inc. All Rights Reserved.
The Unicode Consortium makes no expressed or implied warranty of
any kind, and assumes no liability for errors or omissions. No liability is
assumed for incidental and consequential damages in connection with or arising
out of the use of the information or programs contained or accompanying this
technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and
are registered in some jurisdictions.