L2/99-172

PROPOSED DRAFT Unicode Technical Report #19

UTF-32

Revision 2
Authors Mark Davis (mark@unicode.org)
Date 1999-05-31
This Version TBD
Previous Version TBD
Latest Version TBD
Unicode Technical Reports http://www.unicode.org/unicode/reports/

Summary

This document specifies a four-byte Unicode Transformation Format. The document is in initial phase, and has not gone through the editing process. We welcome review feedback and suggestions on the content.

Status of this document

This document is an unpublished, preliminary working draft. It is posted for general review. At its next meeting, the Unicode Technical Committee (UTC) may reject this document, review it for suitability to progress to draft status and/ or further amend this document. Please mail any comments to the authors.

This document does not, at this time, imply any endorsement by the Consortium's staff or member organizations.

Introduction

Unicode is most commonly serialized using either the 8-bit form (UTF-8) or the 16-bit forms (UTF-16BE, UTF-16LE, or UTF-16). However, some applications may wish to use a 32-bit form, where each Unicode scalar value corresponds to a single 32-bit unit. Even those applications that do not use this form may want to convert to and from it for interoperability.

This document provides a specification of such an encoding form, called UTF-32. UTF-16 is very close to the UCS-4 encoding form defined in ISO 10646, but has some important differences.

Definitions

D1 A Unicode Transformation Format (UTF) is a mapping from each Unicode code character sequence to a unique sequence of code values. These code values are particular units of computer storage specified by the transformation format, typically bytes. Any sequence of code values that would correspond to scalar values above 10FFFF16 are illegal.
D2 UTF-32BE is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in Big Endian Format. An initial sequence corresponding to U+FEFF is interpreted as a zero width no-break space.
  • In UTF-32BE, <0061 D808 DF45> is serialized as <00 00 00 61 00 01 23 45>
D3 UTF-32LE is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in Little Endian Format. An initial sequence corresponding to U+FEFF is interpreted as a zero width no-break space.
  • In UTF-32BE, <0061 D808 DF45> is serialized as <61 00 00 00 45 23 01 00>
D4 UTF-32 is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in either Big Endian or Little Endian Format. An initial sequence corresponding to U+FEFF is interpreted as a byte order mark, and is used to distinguish between the two endians for the rest of the text. It is not considered part of the content of the text. A serialization of Unicode values into UTF-32 may or may not begin with a byte order mark.
  • In UTF-32BE, <0061 D808 DF45> is serialized as:
    • <00 00 FE FF 00 00 00 61 00 01 23 45>,
    • <FF FE 00 00 61 00 00 00 45 23 01 00>, or
    • <00 00 00 61 00 01 23 45>

Conformance

When a process interprets a byte sequence in a Unicode Transformation Format, it shall interpret that byte sequence in accordance with the character semantics established by the Unicode Standard for the corresponding Unicode character sequence.

When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed byte sequences. When a process interprets data in a Unicode Transformation Format, it shall treat illegal byte sequences as an error condition.


Copyright

Copyright &COPY; 1998-1998 Unicode, Inc. All Rights Reserved.

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.