Re: UTF-8 Syntax

From: toby_phipps@peoplesoft.com
Date: Fri Jun 08 2001 - 19:11:14 EDT


As one of the proponents of the UTF-8S proposal, I feel compelled to
respond to some of the recent comments regarding the proposal on the
unicode and unicore lists. Although there have been some good comments
about how the goals of the proposal could be accomplished without a new
encoding form, there have also been numerous arguments against UTF-8S
varying from simply unprofessional (the WTF thread) through to blatently
false (encoding doesn't imply a collation). Let me address each of the
comments individually. There's been a lot of talk about the UTF-8S
proposal on both the unicode and unicore list, so please forgive me (and
notify me if you feel the need) if I have missed any of the salient points
that require a response.

--
Toby Phipps
PeopleTools Product Manager - Global Technology
PeopleSoft, Inc.
tphipps@peoplesoft.com

1. UTF-8S doesn't need to be "accepted" or "approved" by the UTC, as its use is within a proprietary, closed system.

Nothing could be further from the truth. Just look at which companies are pushing the proposal (Oracle, SAP, PeopleSoft). These organizations all share the same technological issue, but are also direct competitors. We share a common technology - that of large SQL databases, and in the case of PeopleSoft and SAP, heterogeneity across many different SQL databases. We need a commonly understood UTF-8 encoding that can be used as a database encoding, an in-memory encoding and other "internal" forms, but at the same time, passed between systems from different vendors. PeopleSoft and SAP support a range of database platforms, including Oracle, Microsoft SQL Server, and IBM DB2. Communication between *applications* from one vendor to a *database* from another vendor is not a closed system.

2. An encoding form does not imply a collation

False. The most basic collation in any system is the binary order of the codepoints in their current encoding. That's what C gives you with the strcmp( ) function, what COBOL gives you with " > ", what Java gives you with its basic string classes. Even though the binary collation of each Unicode transformation makes no linguistic sense, developers all over the world make use of binary collation string comparisons to optimize code, especially when dealing with huge volumes of data. Just looking at PeopleSoft's tens of millions of lines of code, the great majority of our collation-depedent comparisons (eg. comparisons returning more information than simple equivalence) are used for performance and optimization.

There are most definately cases where we need a linguistic comparison, and we have the appropriate syntax in each of our languages (except COBOL) to deal with this. However, these cases are rare, and typically the developer is aware that they are performing a collation whose result will be visible to the user, and therefore needs to be in linguistic order.

Given the proliferation of UTF-16-based programming languages (Java, Microsoft Win32 C/C++, increasing numbers of non-Win32 C compilers), the combination of a UTF-16 based database client communicating with a UTF-8 based database server is common. Without UTF-8S (and UTF-32S to a lesser extent) as a database encoding, creating a single, portable database client in a UTF-16-based language environment that can operate against a database backend encoded in any of the Unicode transforms would be very difficult. Introducing an alternative database encoding along the lines of UTF-8S would allow the same UTF-16-based client application to operate against either a UTF-8S or UTF-16 database without change.

3. Vendors can't expect that other encodings collate the same in binary, so why expect this of the Unicode transforms.

This is true. We can't expect most other encodings to compare the same in binary. This often leads us to the situation where we only support servers and clients that share the same encoding. Before we supported Unicode, with a couple of exceptions (EBCDIC being one), this was the case at PeopleSoft - we require our servers and clients to share the same encoding. In reality, this wasn't a big deal for our customer base - there was very little utility running a server in ISO 8859-2 and a client in ISO 8859-1. Only the lower 7-bits represent common characters (and were therefore usable), so the system may as well be running in 7-bit ASCII. Where this did hurt were the CJK encodings. We don't support running a Shift-JIS client against a EUC-JP database server. Binary collation is just one reason. Expansion/contraction of character lengths is another. The implementation of Unicode across our systems fixed most of this problem. We all changed our database column size quantities to be character-based, not byte based, so the character length issue went away, and until real surrogates appeared on the scene with Unicode 3.1, we could rely on a common binary collation between client and server tiers.

4. A database should be able to provide sorted output in any collation, not just the binary collation of its encoding

True. However, for most SQL databases (at least those that use sorted b-tree indexes such as Oracle, Microsoft SQL Server, Sybase, DB2/UDB etc.), it is much, much faster and efficient to provide data collated in the binary encoding of the database than in any other collation. Why? Because column indexes are stored on-disk in a binary-sorted order. In order to return a pre-ordered result set to a SQL query, the database simply has to do what's known as a "index only scan". In this case, the values returned in the result set are read directly from the index, and the actual data blocks don't need to be fetched.

Of course, just about every database allows the result set to be in a collation other than the binary sort of the database's binary encoding. There are several ways of doing this. One is to sort data in temporarily-allocated memory. This is incredibly inefficient, not only because significant amounts of temporary space needs to be allocated and freed, but also because the entire result set of the query has to be processed and sorted before the first row is returned. With result sets involving several million rows, this is a very significant overhead, especially if the typical user only looks at the first couple of hundred. So, some vendors allow the creation of additional indexes, sorted by a weighted collation key of the original value. This works well in practice, however it still doesn't allow for "index only scans" as in the binary collation example, as the index only stores the numerical collation key, and not the actual value. After fetching the row from the sorted index, the database must then fetch the actual data from the data block.

Given this architecture (which is common across many SQL database platforms), the most efficient way of encode the database is to use an encoding where the binary representation of the data on-disk matches the collation expected most often by the database's clients. In the case of a database with many UTF-16 clients, a database encoding of UTF-16 or UTF-8S would make sense.

5. Oracle is pushing this proposal as it makes it easier for them to support surrogates without changing their architecture

False. Oracle already supports UTF-8S (called UTF8 in their engine for historical reasons), true UTF-8, and UTF-16 all as core database encodings. Oracle gains little from having the UTF-8S encoding accepted as a UTR other than gaining a simple nomenclature to describe one of their supported encodings. It is the large-scale users of Oracle Unicode databases such as SAP and PeopleSoft who are strongly encouraging them to get a common industry acceptance of the UTF-8S transformation for several reasons.

- We believe we won't be the only vendors to have the requirement of equivalent binary sorts across different Unicode encodings. Ignoring non-BMP characters, we have this equivalence now, and I can confidently guess that the majority of database-based Unicode systems today aren't using non-BMP characters in their systems, so their reliance on equivalent binary sorting has not yet become acutely obvious.

- We need some well-known way of describing the encoding of data in the database. This is important for discussions with our customers, documentation and technical architecture disclosures. Without a accepted name such as UTF-8S, we'll be forever talking about the fact that our internal data representation is "like UTF-8, but with individually encoded surrogate pairs". Why do people need to know what our internal database representation is? Because we'll be speaking it over database APIs (eg. PeopleSoft applications to a host Oracle database). Application Developers will see it in-memory when they use our debugging tools. It may "leak" into debug or trace files when things go wrong.

6. The UTF-8S proposal is asking for a "quasi-standard" acceptance which we haven't seen before

False. The Unicode Consortium publishes the Unicode Standard (TUS) and several Unicode Standard Annixes (UAX) which comprise TUS. These are standardized components, and share components (such as the UTF-8 transformation and the code allocations) with ISO 10646. In addition to TUS, the Unicode Consortium publishes Unicode Technical Reports (UTR). UTRs are intended to make life easier for implementors of TUS by providnig common techniques for character representation, encoding, collation and more. There is absolutely no requirement for anyone to implement any component of a UTR in order to claim compliance with TUS. They are for guidance only.

We are proposing UTF-8S as the topic of a UTR. As such, thers is no compunction for any implementor of Unicode to support such an encoding. There is nothing compelling the encoding to be registered in the IANA registry or be recognized by a web browser or XML parser. All we are asking for is that the form of such an encoding be published and recognized, so it can be referred to and used by implementors of the UTC who share a need for equivalent binary collation that we have identified to be not specific to one organization.

This is very similar to the acceptance of UTF-EBCDIC as UTR #16. PeopleSoft is a big user of UTF-EBCDIC. We use it in our COBOL when it's running on an EBCDIC platform. We use it in trace files and dump files on our EBCDIC platforms. Do we expect it to be recognized in HTML? No. XML? No. The same is true for UTF-8S.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT