L2/99-172

PROPOSED DRAFT Unicode Technical Report #19

UTF-32

Revision	2
Authors	Mark Davis ([email protected])
Date	1999-05-31
This Version	TBD
Previous Version	TBD
Latest Version	TBD
Unicode Technical Reports	http://www.unicode.org/unicode/reports/

Summary

This document specifies a four-byte Unicode Transformation Format. The document is in initial phase, and has not gone through the editing process. We welcome review feedback and suggestions on the content.

Status of this document

This document is an unpublished, preliminary working draft. It is posted for general review. At its next meeting, the Unicode Technical Committee (UTC) may reject this document, review it for suitability to progress to draft status and/ or further amend this document. Please mail any comments to the authors.

This document does not, at this time, imply any endorsement by the Consortium's staff or member organizations.

Introduction

Unicode is most commonly serialized using either the 8-bit form (UTF-8) or the 16-bit forms (UTF-16BE, UTF-16LE, or UTF-16). However, some applications may wish to use a 32-bit form, where each Unicode scalar value corresponds to a single 32-bit unit. Even those applications that do not use this form may want to convert to and from it for interoperability.

This document provides a specification of such an encoding form, called UTF-32. UTF-16 is very close to the UCS-4 encoding form defined in ISO 10646, but has some important differences.

UTF-32 is restricted in values to the range 00000000..0010FFFF, which precisely matches the range of characters defined in Unicode (and other standards such as XML).
- While both the Unicode consortium and ISO SC2/WG2 do not ever expect to assign characters above 10FFFF, UCS-4 formally allows values in the range 00000000..7FFFFFFF.
- Moreover, in UCS-4 the code ranges 00E00000..00FFFFFF and 60000000..7FFFFFFF are available for private use.
Over and above ISO 10646, the Unicode Standard adds a number of conformance constraints on character semantics (see The Unicode Standard, Version 2.0, Chapter 3). Declaring UTF-32 instead of UCS-4 allows implementations to explicitly commit to Unicode semantics.
UTF-32 has explicitly named variants to account for differences in endianness on different platforms. These correspond to the forms of UTF-16.
Notationally, the term "UTF-32" is parallel to "UTF-16" and "UTF-8", avoiding some confusion among software developers (especially since the pronunciations of "UTF" and "UCS" are so very similar).

Definitions

D1	A Unicode Transformation Format (UTF) is a mapping from each Unicode code character sequence to a unique sequence of code values. These code values are particular units of computer storage specified by the transformation format, typically bytes. Any sequence of code values that would correspond to scalar values above 10FFFF₁₆ are illegal.
D2	UTF-32BE is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in Big Endian Format. An initial sequence corresponding to U+FEFF is interpreted as a zero width no-break space. In UTF-32BE, <0061 D808 DF45> is serialized as <00 00 00 61 00 01 23 45>
D3	UTF-32LE is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in Little Endian Format. An initial sequence corresponding to U+FEFF is interpreted as a zero width no-break space. In UTF-32BE, <0061 D808 DF45> is serialized as <61 00 00 00 45 23 01 00>
D4	UTF-32 is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in either Big Endian or Little Endian Format. An initial sequence corresponding to U+FEFF is interpreted as a byte order mark, and is used to distinguish between the two endians for the rest of the text. It is not considered part of the content of the text. A serialization of Unicode values into UTF-32 may or may not begin with a byte order mark. In UTF-32BE, <0061 D808 DF45> is serialized as: <00 00 FE FF 00 00 00 61 00 01 23 45>, <FF FE 00 00 61 00 00 00 45 23 01 00>, or <00 00 00 61 00 01 23 45>

Conformance

When a process interprets a byte sequence in a Unicode Transformation Format, it shall interpret that byte sequence in accordance with the character semantics established by the Unicode Standard for the corresponding Unicode character sequence.

When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat illegal byte sequences as an error condition.

Copyright

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.