RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

From: Carl W. Brown (
Date: Tue May 29 2001 - 18:02:28 EDT


UTF-8s is essentially a way to ignore surrogate processing. It allows a
company to encode UTF-16 with UCS-2 logic.

The problem is that by not implementing surrogate support you can introduce
subtle errors. For example it is common to break buffers apart into
segments. These segments may be reconcatinated but they may be processed

For example if you break a buffer apart and translate each segment to a code
page what happens when you break a non-plane 0 character into two UTF-8s
encoded surrogates and break the buffer segment between the two surrogates.
You buffer will contain an incomplete character.

The applications will find intermittent translations failures that may be
very hard to isolate.

The code to insure that the two UTF-8s surrogate characters stay together
could be more than it takes to process UTF-16. It is certainly harder to
Q.A. such software.


-----Original Message-----
From: []On
Behalf Of Kenneth Whistler
Sent: Tuesday, May 29, 2001 11:18 AM
Subject: Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and

Doug wrote:

> UTF-8 and UTF-32 should absolutely not be similarly hacked to maintain
> sort of bizarre "compatibility" with the binary sorting order of UTF-16.

> UTC should not, and almost certainly will not, endorse such a proposal on
> part of the database vendors.

I would be loath to take such a sanguine attitude, however.

The UTC first took up this issue formally on April 29, 2000
(document L2/00-139R) when the UTF in question was referred to
as "UTF-8-16". At that time there was some discussion, a great
deal of it opposed to the introduction of another UTF. At that
time, the Peoplesoft representative was tasked to go off an
"summarize the database issues" that underlay the proposal.

After much delay, the issue resurfaced in this last UTC meeting,
as UTF-8S, with some of the concerns addressed and more background
presented about the database performance issues that have been
driving the proposal.

*This* time the reception was not as hostile as a year ago, with
something like a 50/50 split in the committee, and with claims
forwarded in committee that "there is near-universal agreement
among the database vendors", with the noted exception of NCR.
There was a consensus to take no action now, and the Oracle and
Peoplesoft representatives were tasked to make further revisions
and perhaps bring in database specialists to discuss the implementation

The point is that while the UTC did not endorse this proposal as
of May 23, 2001, the pressure to create a UTF-8S is rising, and there
is no guarantee that the UTC will not sway to such support in
the future, despite the logic of the arguments presented against

--Ken Whistler

> -Doug Ewell
> Fullerton, California

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT