DRAFT Unicode Technical Report #19
Interoperable 32-bit Serialization

Revision	4
Authors	Mark Davis
Date	1999-06-21
This Version	http://www.unicode.org/unicode/reports/tr19/tr19-4.html
Previous Version	n/a
Latest Version	http://www.unicode.org/unicode/reports/tr19
Unicode Technical Reports	http://www.unicode.org/unicode/reports/

Summary

This document specifies a four-byte Unicode serialization format. The document is in initial phase, and has not gone through the editing process. We welcome review feedback and suggestions on the content.

Status of this document

This document has been considered and approved by the Unicode Technical Committee for publication as a Draft Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative, or as normative specification. Please mail corrigenda and other comments to Unicore@Unicode.org.

Introduction

The preferred serialization forms for Unicode text are the 16-bit forms (UTF-16BE, UTF-16LE, or UTF-16). The 8-bit form (UTF-8) is recommended for systems that do not permit the use of 16-bit forms. However, some applications may wish to use a 32-bit form, where each Unicode scalar value corresponds to a single 32-bit unit. Even those applications that do not use this form may want to convert to and from it for interoperability.

This document provides a specification of such an encoding form. The working name for this encoding form is UTF-32. It is very close to the UCS-4 encoding form defined in ISO 10646, but has some important differences.

UTF-32 is restricted in values to the range 0..10FFFF₁₆, which precisely matches the range of characters defined in the Unicode Standard (and other standards such as XML).
- While both the Unicode consortium and JTC1/SC2/WG2 do not ever expect to assign characters above 10FFFF₁₆, UCS-4 does not formally restrict the assignment of future characters above that limit.
- In UCS-4 the code ranges E00000₁₆..FFFFFF₁₆ and 60000000₁₆..7FFFFFFF₁₆ are defined for private use, and legal in interchange.
Over and above ISO 10646, the Unicode Standard adds a number of conformance constraints on character semantics (see The Unicode Standard, Version 2.0, Chapter 3). Declaring UTF-32 instead of UCS-4 allows implementations to explicitly commit to Unicode semantics.
UTF-32 has explicitly named variants to account for differences in endianness on different platforms. These correspond to the forms of UTF-16.

(Notationally, the term "UTF-32" is parallel to "UTF-16" and "UTF-8", avoiding some confusion among software developers--especially since the pronunciations of "UTF" and "UCS" are so very similar.)

Definitions

Definition D29 is copied from the Unicode 3.0 Standard (in progress). It is repeated here so that the UTF-32 definitions can follow the same style.

D29

A Unicode transformation format (UTF) transforms each Unicode scalar value into a unique sequence of code values. A UTF may specify a byte order for the serialization of the code values into bytes. A UTF may also specify the use of a byte order mark.

Code values are particular units of computer storage specified by the transformation format, for example, 16-bit integers or bytes. In the latter case, a code value sequence can be referred to as a byte sequence.
Any sequence of code values that would correspond to a scalar value above 10FFFF₁₆ is illegal.

Since every Unicode coded character sequence maps to a unique sequence of code values in a given UTF, a reverse mapping can be derived. Thus every UTF supports lossless round-trip transcoding: mapping from any Unicode coded character sequence S to a sequence of code values and back will produce S again. To ensure round-trip transcoding, a UTF mapping must also map invalid Unicode scalar values to unique code value sequences. These invalid scalar values include FFFE₁₆, FFFF₁₆, and unpaired surrogates.

The definitions D37..D40 provide for the three forms of UTF-32:

*D37*	UTF-32BE is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in Big Endian Format. An initial sequence corresponding to 0000FEFF₁₆ is interpreted as a zero width no-break space. In UTF-32BE, <0061 D808 DF45> is serialized as <00 00 00 61 00 01 23 45>
*D38*	UTF-32LE is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in Little Endian Format. An initial sequence corresponding to 0000FEFF₁₆ is interpreted as a zero width no-break space. In UTF-32LE, <0061 D808 DF45> is serialized as <61 00 00 00 45 23 01 00>
*D39*	UTF-32 is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in either Big Endian or Little Endian Format. An initial byte sequence corresponding to 0000FEFF₁₆ is interpreted as a byte order mark, and is used to distinguish between the two byte orders for the rest of the text. The byte order mark is not considered part of the content of the text. A serialization of Unicode values into UTF-32 may or may not begin with a byte order mark. In UTF-32, <0061 D808 DF45> is serialized as any of the following: <00 00 FE FF 00 00 00 61 00 01 23 45> <FF FE 00 00 61 00 00 00 45 23 01 00> <00 00 00 61 00 01 23 45>

Conformance

When a process interprets a byte sequence in a Unicode Transformation Format, it shall interpret that byte sequence in accordance with the character semantics established by the Unicode Standard for the corresponding Unicode character sequence.

Copyright

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.