Proposal to add another Unicode Encoding Form: UTF-8-16

Date: 24th April 2000

By: Jianping Yang (Oracle Corporation)

Toby Phipps (PeopleSoft, Inc.)

With the imminent encoding of the first characters outside the BMP, it is critical that existing and future systems implementing Unicode provide at least basic support for the use of these characters within information processing software. Lack of support for surrogate characters in many real-world data processing systems will provide yet another reason for requests for new character proposals to be encoded in the BMP, even though they may be suited for other planes. Although the representation of surrogate values in Unicode and its transformation forms has been documented and widely understood for some time, there are several key challenges relating to the representation of surrogate characters in the Unicode Transformation Forms that will significantly hinder and slow support in real-world implementations.

Specifically: when surrogate values are present:

· UTF-8 byte streams can no longer be relied upon to binary collate equivalent UTF-16 data

· Presentation of character semantics to higher-level systems and languages that require co-existence with non-Unicode encodings is hampered by the fact that the codepoint-to-character representation architecture of UTF-16 and UTF-8 differ markedly

Binary Collation

Given the proliferation of locale and/or language specific collation orders and the need for multiple collations of a single data set based on user preferences, the technique of using binary collation internally, and a locale-sensitive collation for data presentation is commonly used. This technique works well when a single Unicode Transformation Form is implemented in a system, or when multiple forms are used without the existence of surrogate values, but when dealing with data encoded in two or more of the Transformation Forms, it breaks down when surrogate characters are present. Information processing systems needing to internally or externally integrate with both UTF-16 and UTF-8 data sources now need to normalize both sources to a single form before being able to rely on a common binary collation.

The combination of UTF-8 and UTF-16 data in a single system is very much a real scenario. Specific examples of such implementation include:

· Unix-based systems using UTF-8 at a presentation level to accommodate the now common UNIX X/Open UTF-8 locales, but processing internally in UTF-16

· Cross-vendor database integration between databases encoded in UTF-8 (such as Oracle 8) and databases encoded in UTF-16 (such as Microsoft SQL Server 7)

· XML data processing where the source or target system implements a UTF-16 encoding

· Legacy code or code that must support multiple non-Unicode encodings that is not tolerant to null bytes integrated with a UTF-16 datastore such as Microsoft SQL Server or UTF-16 API such as Win32.

While it is true that in all these scenarios, normalizing both data sets to a single transformation and performing binary collation against the normalized form is technically possible without data loss, it is a significant additional processing step, and has potentially serious performance implications when dealing with very large volumes of data.

Therefore, the ability to provide the benefits of a UTF-8-like transformation but maintaining equivalent binary collation with UTF-16 is important. Just as important is maintaining backwards-compatibility with the existing UTF-8 transformation when surrogate values are not present

Higher-Level Language Character Semantics

The Unicode Standard offers two primary ways of thinking about the encoding of textual data; character-oriented and codepoint-oriented. The relationship of character to codepoint in each of the Transformation Forms is clearly defined, but not well understood. The example in Table 1 illustrates the relationship of characters to codepoints of each of the Unicode representations, when ignoring the relationship of bytes to codepoints.

	UCS-2	UTF-16	UTF-32	UTF-8
No Surrogates	1:1	Not Applicable - equivalent to UCS-2	1:1	1:1
Surrogates Present	Not Applicable	1:1 or 1:2	1:1	1:1

Table 1: Character-to-codepoint ratios in the form character : codepoint in various Unicode forms

As can be seen, UTF-8's character to codepoint semantics are identical to and can interoperate cleanly with UTF-32's 31-bit character representation. However, when dealing with a mixed UTF-16 and UTF-8 environment, UTF-8's codepoint-to-character semantics are markedly different - a surrogate character requires only a single codepoint in UTF-8 while it requires two in UTF-16.

While this may be viewed as a minor distinction, it has a great impact on the ability of a higher-level language to be able to deal with co-existing UTF-16 and UTF-8 data. With the advent and widespread acceptance of Unicode, Implementers of higher-level languages such as SQL are fast moving away from existing byte-oriented declarations, and adopting codepoint or character-oriented semantics. The ability for a user to declare the size of a container to be appropriate to hold n characters of data is important, as it is not expected that the general community will understand or has the desire to understand lower level data encoding forms.

One approach is to allow higher-level languages to declare text in terms of characters, regardless of on which plane the character is present. Such an implementation maps well to storing and processing data in UTF-32 or UTF-8 given their simple character-to-codepoint ratio of 1:1. However, it is unlikely that we will see widespread implementations of UTF-32 given the high volume of Latin text in legacy, single-byte encodings that exists today. Such an implementation would reserve four bytes of storage for each character in the higher-level language's declaration, sufficient for storage in either UTF-32 or UTF-8.

Another approach that is proving popular is to allow declaration of text in terms of UTF-16 codepoints, given that the great majority of today's characters (and the total content of today's officially encoded data) utilizes only the BMP. This approach would reserve 2 bytes of storage for each codepoint declared.

However, the second, code-point-oriented approach described above does not fit well with intermixed UTF-16 and UTF-8 data, which we have previously shown to be a common requirement. While one or two UTF-16 codepoints may be required to encode a single character, only one UTF-8 codepoint would be required, thus not providing a constant character-to-codepoint ratio between encodings. It would therefore be difficult for such information processing systems to take advantage of the codepoint oriented semantics of UTF-16, with the well-known programming benefits of UTF-8.

Proposed Unicode Transformation Form

Given the numerous technical difficulties described in this document of dealing with mixed UTF-16 and UTF-8 environments when surrogate characters exist, it is likely that many implementers of The Unicode Standard will resort to implementation-specific internal encodings to deal with the differences of binary collation and character-to-codepoint semantics of the two Transformation Forms. These concerns are not vendor-specific, but ones that will be faced by a large community of software designers, and will be particularly important when processing high-volumes of Unicode data efficiently.

Instead of encouraging vendor-specific internal encodings to work around and provide a technical solution to these problems, the addition of a new Unicode Transformation Form that provides complete backward compatibility with UTF-8 without the presence of surrogate values is proposed. Given this compatibility with existing non-surrogate UTF-8 systems, existing vendors using UTF-8 in the process of enhancing their systems could choose freely between continuing with the current UTF-8 approach to encoding surrogates, or implementing the proposed new transformation form. It is likely that many vendors relying on binary collation and codepoint semantics would benefit from using this proposed transformation.

UTF-8-16 (a tentative name) is proposed as being identical to UTF-8 with the exception of encoding surrogate pairs. Instead of extending the existing codepoint by one byte to four bytes to accommodate a surrogate character as does UTF-8, UTF-8-16 encodes the high and low surrogate values as separate codepoints, using the existing UCS-2 to UTF-8 algorithm. Thus, a surrogate pair would consist of two sets of three-byte codepoints.

By implementing the identical algorithm as the UCS-2 to UTF-8 transformation against UTF-16 data at a codepoint-by-codepoint level, this ensures equivalent binary collation between the source UTF-16 data and the target UTF-8-16 transformation. It also maintains a ratio of bytes to codepoints consistent to the source UTF-16, as is required by higher-level languages using codepoint textual declaration semantics.

While it is possible and likely that UTF-8-16 would be implemented only internally by many data processing systems, and that once data is externalized it would be in an existing transformation form, the inclusion of specifications for UTF-8-16 in the Unicode standard would greatly enhance the ability of multiple-vendor Unicode systems to co-operate. It is much better to have an underutilized but well understood Transformation Form that provides practical benefits and solutions to common problems to its implementers rather to have each vendor implement their own private transformation in solitude, resulting in a proliferation of internal, vendor-specific Unicode Transformations, and a difficulty and inefficiency of communication between systems using differing UTFs.