Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed May 30 2001 - 13:49:34 EDT


Doug Ewell wrote:

> > The proponents of UTF-8S are
> > vigorously and actively campaigning for their proposal. In
> > standardization committees, proposals that have committed, active
> > proponents who can aim for the long haul, often have a way of getting
> > adopted in one form or another, unless there are equally committed
> > and active opponents of the proposal. It is just the nature of
> > consensus politicking in these committees, whether corporate based
> > or national body based.
>
> I hate to say it, but this is really damaging my faith in the standardization
> process. I would like to think the UTC would take one look at the UTF-8S
> proposal and weigh its basic principles heavily against the big-corporation
> factor. Do the opponents of UTF-8S need to hire Johnnie Cochran or a
> top-flight advertising agency to balance their clout against Oracle's?

You also need to understand that there is a kind of senatorial
collegiality that sets in in standardization committees. People
who volunteer on these committees tend to specialize and hang around
for years, so you end up seeing the same people again and again.
The person whose proposal you despise today may be the same person
who supports your own proposal tomorrow. It is neither politic nor
socially adept to burn all your bridges and line up a bunch of
longterm enemies over any one particular item you [the committee
member] may feel strongly about. That can backfire severely in the
long run.

The senatorial atmosphere has the added advantage that it tends to
slow down controversial proposals. Since everyone is seeking consensus,
if you run into the committee with a problematical proposal, it
can stir controversy and fail of consensus (unless nobody is paying
attention), and that can drag on literally for years if nothing
is done to actively develop a consensus.

So no, the UTC doesn't need to hire Johnnie Cochran to counter
Oracle's clout, as long as Oracle can't mount the votes to break
the threat of a filibuster. ;-)

>
> Why should it be different for new encoding forms? UTF-16 was invented
> because there was a need to address more than 65,536 code points within a
> 16-bit framework, and the segment-offset model seemed most sensible. Was
> there any one company that "pushed the proposal" for UTF-16?

Actually, I think the approval bar that the UTC sets for major
architectural changes is higher than for new script proposals. Such
things engender far more spirited and lengthy debate in the UTC, and
I cannot recall many significant architectural changes to the standard
that were not the result of near unanimous consent among the voting
members in UTC. Certainly UTF-16, for example, was such an instance.
Actually, the U.S. committee, X3L2, took the lead on that development.

The key paper was X3L2/93-174, "Proposal for UCS-2E (Modified
Proposal for Extended UCS-2)", authored by John Jenkins, then
of Taligent, dated 23 Sept. 1993. This proposal contained the
surrogate pair mechanism, even defined in the same ranges,
D800..DBFF and DC00..DFFF, although the planes defined were
planes D2..E1, instead of 01..10. Later the planes were moved
down to 01..10 and the proposal renamed UTF-16, before it was
formally ballotted for 10646. The key vote of support took place
on November 25, 1993. Supporting the proposal were: Apple,
HP, IBM, Microsoft, RLG, Taligent, and Unisys. Digital abstained.
SHARE was not present, but minuted its opposition to the proposal.
*That's* where UTF-16 came from, and as you can see, it was the
result of near unanimous consent among the voting members of L2.
Subsequent decision points in the UTC and L2 regarding formal
ballotting of the Amendment for UTF-16 were also nearly unanimous.
 
> What's wrong with asking the database vendors to refine their notion of
> "sort" so the sorting comes out right for Unicode?

Nothing. I and a number of other respondents have said as much.

And don't lump all the database vendors together here. This is
clearly an Oracle proposal, seconded by some application vendors.
Some database vendors oppose the proposal. Others are weighing
their internal interests carefully before coming down clearly on
one side or the other.
 
> The database programmer has told me repeatedly that Informix
> can't sort in anything other than straight binary order. Why not? Is it
> less powerful than (heh heh) my C program running at the command prompt?

This is a matter of rather complex internal design issues that the
database vendors face to get queries to run blindingly fast against
data pages. I'm not apologizing for tradeoffs that then may cause
problems in presenting data in user-preferred orders, but in the
enterprise database world, performance issues nearly always trump
concerns about internationalization niceties, like it or not.

Chances are your C program running at the command prompt wouldn't
do so well if it was being pounded on by queries originating from
3000 simultaneous connections from Wall Street traders. *hehe*

> But with UTF-8S, it
> looks like we are taking the worst feature of UTF-16 (the code points used by
> the surrogates), adding a pinch of laziness and a dash of complacency, and
> trying to sell the result to a UTC that really should have higher standards
> for its own creation.

You'll find no argument from me on that point. ;-)

> I would really hate to see what bizarre things might be proposed
> next if UTC sets a precedent by approving UTF-8S.

Plane 14 PUA usage description tags? Naaah, nobody would suggest such
a bizarre thing, would they?

--Ken
>
> -Doug Ewell
> Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT