L2/05-356

Markus Scherer, 2005-10-28

Proposed UTN "Unicode, BOM, Signatures"

On 2005-10-26, I submitted a document with the title "Unicode, BOM, Signatures" to the Unicode editorial committee for consideration as a Unicode Technical Note. The document, appended below (without the UTN boilerplate sections), gives recommendations for the use of BOMs and signatures, and for which Unicode charsets should generally be supported and used. Background information explains the terminology and necessary details. The goal is to give developers concrete recommendations that result in a high level of interoperability.

The committee felt that the topic was sufficiently important to be brought to the attention of the UTC. Possible outcomes may include adding such recommendations to the Unicode Standard (for example, in chapter 5), publishing a new UTR, or publishing a UTN.

Document proposed for consideration as a UTN, here without the UTN boilerplate sections, and including changes based on feedback from the editorial committee:

Summary

This document attempts to explain Unicode Byte Order Marks (BOMs) and signatures and gives recommendations for the use of Unicode charsets for text files and protocols.

Status [from the UTN template]

This document is a Unicode Technical Note. It is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium. For general information on Unicode Technical Notes, see http://www.unicode.org/notes/.

1 Introduction
2 Recommendations
3 Background Information
4 Samples
Acknowledgments
References
Modifications

1 Introduction

Unicode encodes about 100,000 characters (Unicode 4.1, 2005) and has space for more than a million. It is not possible to use just one or two bytes for each character unless some kind of compression is used. Unlike older character sets, Unicode is often used with 16-bit or 32-bit units rather than bytes.

Some Unicode encoding schemes include an endianness (byte order) indication, and there are conventions for using special byte sequences to indicate Unicode text and its encoding.

This document attempts to explain the use of these indicators and sequences, and to give recommendations, without explaining anything else about Unicode other than to make a self-contained text. The recommendations are first, followed by the technical details. If you are unfamiliar with the terminology in the recommendations, please consult the background sections.

2 Recommendations

The following recommendations are based on best practices that result in a high level of interoperability. They apply to Unicode text files and file-like protocols (not to APIs that handle Unicode text).

Reading Text Files

Software that reads text data should handle UTF-8 and UTF-16, wherever allowed by protocols.
Software should ignore initial U+FEFF characters in Unicode text files. It should replace all instances of U+FEFF that are not used as a BOM/signature by U+2060 (Word Joiner).
If a text file starts with EF BB BF, assume it's UTF-8. Convert to Unicode, and then strip or ignore the initial U+FEFF character.
If a text file starts with FE FF or FF FE (not followed by 00 00), assume it's UTF-16. Convert to Unicode, and then strip or ignore the initial U+FEFF character. (A conformant UTF-16 converter will consume the BOM, that is, it will not emit a U+FEFF character for it).

Writing Text Files

When writing UTF-8 text files, check if the receiver can handle a signature. Add one if appropriate, don't if not. (For example, some XML parsers reject the UTF-8 signature.)
When writing UTF-16 text files, always emit the BOM. Converters should do so automatically.
When protocols (like DCOM and CORBA) allow the exchange of integer units larger than bytes, the UTF-16 encoding form (without the BOM, not the encoding scheme) should be used directly as a stream of 16-bit code units. (The package or message envelope in this case handles the byte serialization and endianness for all larger-than-byte units together.)
Don't encode Unicode text in anything other than UTF-8 or UTF-16 unless a protocol requires it.

While processing Unicode text in one of the Unicode encoding forms, a BOM is unnecessary and should never be used.

3 Background Information

Code Points

A Unicode code point is an integer in the range 0..0x10FFFF (decimal 0..1,114,111). It is often written as U+0000..U+10FFFF with 4 to 6 hex digits. Encoded characters have code points assigned to them. Code points without characters are mostly unassigned. There are a few special code points with special status.

Example code points:

U+004F: Assigned Latin letter 'O'.
U+4F00: Assigned Chinese character '伀'.
U+FEFF: Assigned formatting character, but used mostly for its special byte encodings, for a BOM or signature. The use of U+FEFF as a formatting character has been deprecated, and U+2060 should be used for that function instead. (See the Unicode Standard for details.)
U+FFFE: Noncharacter, will never be used to encode a real character and should never be used in public text data exchange. In UTF-16, its byte serialization is the reverse of the one for U+FEFF.

Encoding Forms

Encoding Forms define how Unicode is processed in software where 16/32-bit integers are available. It's what programmers see in C, Java, etc. The endianness of integers is defined by the CPU (and usually invisible because integers are used as atomic units) and therefore need not be specified for each string. There are three Unicode Encoding Forms:

UTF-8: A code point is encoded with 1, 2, 3, or 4 bytes (8-bit code units).
UTF-16: A code point is encoded with 1 or 2 16-bit code units.
UTF-32: A code point is encoded with 1 32-bit code unit.

Encoding Schemes

Encoding Schemes represent Unicode text as streams of bytes. They are used in files and byte-oriented protocols. In other words, this is what is visible as web page and email charsets, for example. The endianness is important, because any unit that is larger than a byte can be serialized with either the most or the least significant byte first. There are 7 Unicode Encoding Schemes defined by the standard itself:

UTF-8: Works the same as the UTF-8 encoding form.
UTF-16, UTF-16BE, UTF-16LE: Byte serializations of the UTF-16 encoding form. A pair of bytes corresponds to each 16-bit code unit. The name "UTF-16" means different things whether it refers to an encoding form or an encoding scheme!
UTF-32, UTF-32BE, UTF-32LE: Byte serializations of the UTF-32 encoding form. A sequence of 4 bytes corresponds to each 32-bit code unit. The name "UTF-32" means different things whether it refers to an encoding form or an encoding scheme!

Byte Order

The UTF-16/32 "BE" (big-endian) encoding schemes always put the high byte(s) first. The UTF-16/32 "LE" (little-endian) encoding schemes always put the low byte(s) first. The UTF-16/32 encoding schemes without suffix use an optional Byte Order Mark (BOM), which is really the appropriate encoding of U+FEFF. If the big- or little-endian byte sequence for U+FEFF is at the start of a UTF-16/32 byte stream, then it indicates the endianness of the entire stream. The initial U+FEFF is not part of the text in this case! If the stream does not start with a BOM, then it defaults to big-endian. This is part of the UTF-16/32 encoding scheme specifications themselves.

U+FEFF was chosen because its byte serialization uses the two highest byte values, which are uncommon at the start of text byte streams in any charset.

Non-initial U+FEFF are formatting characters and must be included in the text.

Charsets

Text is sometimes encoded with algorithms that do not fit the definition of an encoding scheme. All methods of mapping between text and streams of bytes are included here under the more general term "charsets".

For example, there are Unicode charsets that are designed for special purposes (and are used much less frequently than the Unicode encoding schemes), such as UTF-7, SCSU, and BOCU-1.

Signatures

The BOM — the big-endian or little-endian encoding of U+FEFF — has such a distinctive byte pattern that it is also commonly used not only to find out the endianness of text that is known to be in a particular encoding scheme, but also to find out whether it is in any Unicode charset at all. Thus, the byte sequences corresponding to an initial U+FEFF are sometimes used to detect Unicode text (as opposed to text in a legacy charset) and distinguish between different Unicode charsets. These byte sequences are called signatures. Signatures are used regardless of whether a particular charset would need an endianness indicator because they also distinguish between the different charsets.

When software detects a signature, it should convert the byte stream to Unicode according to the corresponding charset from the table below and then remove the initial U+FEFF. (It is sometimes possible to instead remove the signature byte sequence before the conversion, but this does not work for some of the stateful charsets.)

Signature addition, detection, and stripping is a higher-level protocol and may be handled on top of the conversion itself.

Table 1: Signature Byte Sequences
Byte Sequence	Unicode Encoding Scheme
`FE FF`	UTF-16BE
`FF FE` (not followed by `00 00`)	UTF-16LE
`00 00 FE FF`	UTF-32BE
`FF FE 00 00`	UTF-32LE
`EF BB BF`	UTF-8
Byte Sequence	Other Unicode Charset
`0E FE FF`	SCSU
`FB EE 28`	BOCU-1 (U+FEFF must be removed after conversion)
`2B 2F 76 38 2D or 2B 2F 76 38 or 2B 2F 76 39 or 2B 2F 76 2B or 2B 2F 76 2F`	UTF-7 (only the first sequence can be removed before conversion; otherwise U+FEFF must be removed after conversion)
`DD 73 66 73`	UTF-EBCDIC

The detection of signature byte sequences can be a very simple part of a heuristic charset detection that also detects legacy charsets.

4 Samples

In the following tables, a * stands for a U+FEFF character which has no visible form. A ~ stands for a U+FFFE noncharacter code point.

Without Signature Detection

The input consists of a sequence of bytes and a charset declaration. The output is Unicode text.

Table 2: Interpreting Byte Sequences with Known Charsets
Input		Output
Bytes	Charset	Code Points	Text
`4F E4 BC 80`	UTF-8	`004F 4F00`	O伀
`EF BB BF 4F E4 BC 80`	UTF-8	`FEFF 004F 4F00`	*O伀
`EF BB BF 4F EF BB BF E4 BC 80`	UTF-8	`FEFF 004F FEFF 4F00`	O伀
`00 4F 4F 00`	UTF-16	`004F 4F00`	O伀
`00 4F 4F 00`	UTF-16BE	`004F 4F00`	O伀
`00 4F 4F 00`	UTF-16LE	`4F00 004F`	伀O
`FE FF 00 4F 4F 00`	UTF-16	`004F 4F00`	O伀
`FE FF 00 4F FE FF 4F 00`	UTF-16	`004F FEFF 4F00`	O*伀
`FF FE 4F 00 00 4F`	UTF-16	`004F 4F00`	O伀
`FF FE 00 4F 4F 00`	UTF-16	`4F00 004F`	伀O
`FE FF 00 4F 4F 00`	UTF-16BE	`FEFF 004F 4F00`	*O伀
`FE FF 00 4F 4F 00`	UTF-16LE	`FFFE 4F00 004F`	~伀O
`00 00 00 4F 00 00 4F 00`	UTF-32	`004F 4F00`	O伀
`00 00 00 4F 00 00 4F 00`	UTF-32BE	`004F 4F00`	O伀
`00 00 00 4F 00 00 4F 00`	UTF-32LE	error, illegal byte sequence
`00 00 FE FF 00 00 00 4F 00 00 4F 00`	UTF-32	`004F 4F00`	O伀
`00 00 FE FF 00 00 00 4F 00 00 FE FF 00 00 4F 00`	UTF-32	`004F FEFF 4F00`	O*伀
`FF FE 00 00 4F 00 00 00 00 4F 00 00`	UTF-32	`004F 4F00`	O伀
`FF FE 00 00 00 00 00 4F 00 00 4F 00`	UTF-32	error, illegal byte sequence

With Signature Detection

The input consists of a sequence of bytes starting with a signature byte sequence. The output is Unicode text (with the initial U+FEFF removed), and the charset name.

Table 3: Interpreting Byte Sequences with Signature Detection
Input	Output
Bytes	Charset	Code Points	Text
`EF BB BF 4F E4 BC 80`	UTF-8	`004F 4F00`	O伀
`EF BB BF 4F EF BB BF E4 BC 80`	UTF-8	`004F FEFF 4F00`	O*伀
`FE FF 00 4F 4F 00`	UTF-16BE	`004F 4F00`	O伀
`FE FF 00 4F FE FF 4F 00`	UTF-16BE	`004F FEFF 4F00`	O*伀
`FF FE 4F 00 00 4F`	UTF-16LE	`004F 4F00`	O伀
`FF FE 00 4F 4F 00`	UTF-16LE	`4F00 004F`	伀O
`00 00 FE FF 00 00 00 4F 00 00 4F 00`	UTF-32BE	`004F 4F00`	O伀
`FF FE 00 00 4F 00 00 00 00 4F 00 00`	UTF-32LE	`004F 4F00`	O伀
`FF FE 00 00 4F 00 00 00 FF FE 00 00 00 4F 00 00`	UTF-32LE	`004F FEFF 4F00`	O*伀
`FF FE 00 00 00 00 00 4F 00 00 4F 00`	UTF-32LE	error, illegal byte sequence

Acknowledgments

Thanks to Dale Schultz, Mark Davis and Ken Whistler for valuable feedback on this document.