Network Working Group D. Goldsmith Internet Draft Apple Computer, Inc. Expires: 7 March 1999 7 September 1998 Will obsolete: RFC 1641 Using Unicode with MIME Status of this Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months. Internet-Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet-Drafts as reference material or to cite them other than as a "working draft" or "work in progress". To learn the current status of any Internet-Draft, please check the 1id-abstracts.txt listing contained in the Internet-Drafts Shadow Directories on ds.internic.net (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim). Distribution of this document is unlimited. Please send comments to the author at . This document is intended to become an informational RFC. Abstract The Unicode Standard, version 2.0, and ISO/IEC 10646-1:1993(E) (as amended) jointly define a character set (hereafter referred to as Unicode) which encompasses most of the world's writing systems. However, Internet mail (STD 11, RFC 822) supports only 7-bit US ASCII as a character set. MIME (RFC 2045 through 2049) extends Internet mail to support different media types and character sets, and thus could support Unicode in mail messages. MIME neither defines Unicode as a permitted character set nor specifies how it would be encoded, although it does provide for the registration of additional character sets over time. This document describes an approach to transmitting the UTF-16 transformation format of Unicode. UTF-16 is a format which includes 16 bit Unicode characters, as well as an encoding of characters outside the Basic Multilingual Plane. Motivation In contexts where compatibility with existing software is important, other transformation formats of Unicode are available and are already registered with IANA, namely UTF-8 and UTF-7. However, both of these formats impose a space penalty for text that is not in ASCII. Using the original 16 bit form of Unicode provides a way to transmit character data that is uniform in size. Some products and network standards already specify 16 bit Unicode, but there is no suitable IANA registration. Background UTF-16 is defined in "The Unicode Standard, Version 2.0". Briefly, Unicode characters in the Basic Multilingual Plane (plane 0) are encoded directly as 16 bit entities. Characters outside of plane 0 can be encoded using a pair of 16 bit entities, from the "surrogate" range of plane 0. The details are beyond the scope of this document and may be found in the above referenced standard. However, these are 16 bit entities, not a sequence of octets, and the standard allows either big-endian or little-endian order. This document addresses the need to disambiguate these two cases. Definitions First, the definition of Unicode: The 16 bit character set Unicode is defined by "The Unicode Standard, Version 2.0". This character set is identical with the character repertoire and coding of the international standard ISO/IEC 10646-1:1993(E); Coded Representation Form=UCS-2; Subset=300; Implementation Level=3, including the first 7 amendments to 10646 plus editorial corrections. Note. Unicode 2.0 further specifies the use and interaction of these character codes beyond the ISO standard. However, any valid 10646 sequence is a valid Unicode sequence, and vice versa; Unicode supplies interpretations of sequences on which the ISO standard is silent as to interpretation. Note. Unicode has since been amended to version 2.1, by Unicode Technical Report #8 [UNICODE 2.1]. The standard will continue to evolve over time. UTF-16BE and UTF-16LE Definitions UTF-16BE and UTF-16LE both correspond to the UTF-16 transformation format of Unicode, as specified in "The Unicode Standard, Version 2.0". UTF-16BE serializes the octets which make up a single 16-bit UTF-16 value in big-endian order. UTF-16LE serializes the octets which make up a single 16-bit UTF-16 value in little-endian order. The Unicode character ZERO WIDTH NON-BREAKING SPACE (U+FEFF) does not override the byte order for either of these specifications. UTF-16BE values must be big-endian, and UTF-16LE values must be little-endian. To encode UTF-16 as either UTF-16BE or UTF-16LE, the UTF-16 values are serialized successively as described above. Use of Character Sets UTF-16BE and UTF-16LE Within MIME Since both UTF-16BE and UTF-16LE are 16 bit character encodings which are not supersets of US-ASCII, they may not be used for MIME encoding of e-mail [RFC 822], as mandated in the MIME specification [MIME]. These character sets may only be used for protocols which allow a 16-bit encoding. The MIME charset identifier for UTF-16BE is "UTF-16BE", and for UTF- 16LE is "UTF-16LE". Both of these identifiers signify version 2.0 or later of The Unicode Standard. Summary The UTF-16BE and UTF-16LE encodings allow Unicode characters to be transmitted as 16 bit values. This allows greater efficiency in the transmission of multilingual text. UTF-16 should only be used with protocols which allow 16 bit characters. In particular, it may not be used to encode e-mail [RFC 822]. Acknowledgements Many thanks to the following people for their contributions, comments, and suggestions. If we have omitted anyone it was through oversight and not intentionally. Mark Davis Ken Whistler Martin Duerst Security Considerations Security issues are not discussed in this memo. References [UNICODE 2.0] "The Unicode Standard, Version 2.0", The Unicode Consortium, Addison-Wesley, 1996. ISBN 0-201-48345-9. [UNICODE 2.1] "The Unicode Standard, Version 2.1", The Unicode Consortium, Technical Report #8. Available at . [ISO 10646] ISO/IEC 10646-1:1993(E) Information Technology--Universal Multiple-octet Coded Character Set (UCS). See also amendments 1 through 7, plus editorial corrections. [US-ASCII] Coded Character Set--7-bit American Standard Code for Information Interchange, ANSI X3.4-1986. [RFC822] Crocker, D., "Standard for the Format of ARPA Internet Text Messages", STD 11, RFC 822, UDEL, August 1982. [MIME] Borenstein N., N. Freed, K. Moore, J. Klensin, and J. Postel, "MIME (Multipurpose Internet Mail Extensions) Parts One through Five", RFC 2045, 2046, 2047, 2048, and 2049, November 1996. Author's Address David Goldsmith Apple Computer, Inc. 2 Infinite Loop, MS: 302-2IS Cupertino, CA 95014 Phone: 408-974-1957 Fax: 408-862-4566 EMail: goldsmith@apple.com