Proposal for Equivalent Binary Collation
across Unicode Transformation Forms
Jianping Yang (Oracle Corporation)
Nobuyoshi Mori (SAP, AG)
Toby Phipps (PeopleSoft, Inc.)
May 23, 2001
Considering the three Unicode encodings, UTF-8, UTF-16 and UTF-32, each UTF-n variation is merely a mathematical transformation of Unicode.
When not considering non-BMP characters, UTF-n's are originally very well designed (whether by design or accident) in a way that the binary order of character values are preserved after the transformation across all 3 forms. The problem arises with the introduction of surrogates: the order in UTF-16 is broken, based on the design that surrogates are located at the position, U+D800 - U+DFFF. If we would had assigned surrogate in U+F800 - U+FFFF, we could preserve the order also in UTF-16. The current order status is:
UTF-8: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
UTF-16: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
As described in previous discussion papers, this difference in the binary ordering of data in each UTF-n transformation is of great concern to the database community, both database vendors and consumers of database tools. Specifically, any system that must deal with the indexing and comparison of large collections of data across multiple encodings will run into the issue that data that is ordered based on the binary values in one encoding will no longer be ordered such once transformed into another encoding. While this lack of binary ordering compatibility across different encodings is very true and well-understood in the world of legacy encodings (such as a transcode of Shift-JIS to EUC-JP), given that all the Unicode Transformation Forms are maintained by a single committee, it should be possible to come up with a common binary order between each of the three main Unicode Transformation Forms.
Without a set of transformation forms that share a binary ordering, the implementation of cross-platform systems, (and in particular, database-based systems) that appropriatey support non-BMP characters will likely be slowed. Existing systems that can currently rely on the equivalent binary sort between all UTFs will break when surrogate values are introduced.
This paper does not propose to modify any of the existing forms, and therefore break any applications that may exist that correctly implement these forms today Instead, it proposes to add at least one optional additional UTF, in the form of a Unicode Technical Report (UTR). This form could be implemented by system designers where the benefit of a common binary ordering across different UTFs is important from a performance standpoint, or for other reasons. The new UTF(s) would have equivalent standing as the UTF-EBCDIC transformation currently maintained in UTR#16. It is not proposed that the new transformation form(s) become Standard Annexes (UAX), nor would they be proposed for inclusion in ISO 10646.
This suggestion, to tolerate 2 x 3 byte notation of surrogate pairs in an encoding named “UTF8-S” is an attempt to preserve the UTF-16 order in UTF-8. We would apply the same thoughts to UTF-32, and map (U+E000-U+FFFF) behind the surrogate so that we need only 4 bytes to express one surrogate character and we keep the UTF-16 order, and creating a UTF-32S encoding. Thus, UTF-32S, UTF-16 and UTF8-S would all yield a common binary sort.
This is a UTF-16-order oriented approach. We have another possibility which is to create a UTF-16S encoding in which the surrogate is ordered behind the existing codepoints (U+E000-U+FFFF). But it seems to us that enhancement of UTF-8 and UTF-32 is less problematic at this point of the time, especially given that an additional UTF-16 encoding designed to bring its binary sort in line with UTF-8 and UTF-16 would necessarily break its compatibility with UCS-2 for BMP characters.
Following are some typical use-cases where the binary sorting of data across Unicode Transformation Forms is critical:
A SQL statement executed on the database server returns a result set ordered by the binary sort of the data in UTF-8, given that this is the encoding of both data and indexes in the database.
A C/C++ or Java UTF-16-based client receives this result set and must compare it to a large collection of data stored locally in UTF-16. Unless this local data collection is transformed to UTF-8 and then re-ordered (or re-ordered using the proposed UTF-16S mutated comparison), there is no efficient way to compare the two sets of data without having to sort both in memory on the client machine in its native encoding (UTF-16 in this case). This is exacerbated greatly if the UTF-16 comparison data set is significantly large.
Consider a Client/Server system (or the Client/Application Server tiers of an n-tier system), where the client runs in a UTF-16 environment (such as Win32 or Java), and the server runs on a UTF-8 or UTF-32 based environment such as most Unixes (given the 32-bit direction of wchar_t on Unix). If the client in such a system generated a large datafile, ordered based on the binary sort of its encoding, UTF-16, and then sent this file to the server for processing, the order of the data in the file would be lost when transcoding it to UTF-8 or UTF-32 to be processed by the server. Instead of being able to make use of the pre-sorted data, the ordering process would need to be repeated on the server before it could trust any comparison of the file-based data against an existing ordered-list in memory, in other files or in a back-end database.
An valid alternative to creating any new Unicode Transformation Forms would be to enhance existing database systems to allow the sorting of data in a transformation form other than the one in which the data is stored. Cf. a UTF-8 database providing collation in UTF-16. While this is technically possible given today’s database technology, and the use of function-based indexes (where an index can be created off a computed value, such as a sort key), it is not palatable in large-scale implementations. One of the main reasons for the reluctance to implement “alternate” non-binary or “modified-binary” sorts in database systems is the requirement to maintain duplicate indexes in order to avoid a runtime memory-sort. Virtually all popular SQL databases must create indexes in the binary sort order of the underlying encoding, given the usage of binary tree (B-Tree) index technology. Querying these indexes directly will provide a result set pre-ordered by the binary values of the data in its host encoding, and require no re-sort in memory. Many of today’s large database-based systems including SAP, PeopleSoft and the Oracle Application Suite (which together make up over 80% of the ERP market), rely on this index-based sort for performance of internal SQL operations.
Although platform-specific SQL syntax and index functionality exists to pre-build sorted indexes using values other than the binary representation of the data in the underlying encoding, this requires the maintenance of duplicate sets of indexes, and non-standard SQL syntax to retrieve data sorted in this “non-binary” order. It could be argued that any binary-based order is not acceptable to the ultimate user of such a system, however in most cases, the order of data returned by the database is critical to internal processing of the application server or client system, and not necessarily the order of the data presented to the user.