Re: UTF-8S (was: Re: ISO vs Unicode UTF-8)

Date: Wed May 30 2001 - 02:31:17 EDT

In a message dated 2001-05-29 11:20:48 Pacific Daylight Time,

> The point is that while the UTC did not endorse this proposal as
> of May 23, 2001, the pressure to create a UTF-8S is rising, and there
> is no guarantee that the UTC will not sway to such support in
> the future, despite the logic of the arguments presented against
> UTF-8S.

Gee, and here I thought Antoine was joking with his UTF-32S proposal. Maybe
that will be accepted, too.

The binary sort order of UTF-16:

    U+0000 -> U+D7FF
    U+10000 -> U+10FFFF
    U+E000 -> U+FFFF

is an ACCIDENT, an irrelevant side effect of the way UTF-16 happened to be
implemented. It should not serve as the basis for a new encoding form!
Collating the characters from U+E000 through U+FFFF *after* the supplementary
planes makes about as much sense as, say, starting with the rightmost hex
digit and working your way to the left, so that all characters U+xxx0 sort
before U+xxx1, etc. (Hey, you don't suppose...)

kenw later wrote:

> The proponents of UTF-8S are
> vigorously and actively campaigning for their proposal. In
> standardization committees, proposals that have committed, active
> proponents who can aim for the long haul, often have a way of getting
> adopted in one form or another, unless there are equally committed
> and active opponents of the proposal. It is just the nature of
> consensus politicking in these committees, whether corporate based
> or national body based.

I hate to say it, but this is really damaging my faith in the standardization
process. I would like to think the UTC would take one look at the UTF-8S
proposal and weigh its basic principles heavily against the big-corporation
factor. Do the opponents of UTF-8S need to hire Johnnie Cochran or a
top-flight advertising agency to balance their clout against Oracle's?

Let's look at the process of proposing new characters or scripts for
Unicode/10646. There is a standard form that must be filled out, of course,
but that's just the beginning. The proponent must justify the inclusion of
the new characters, perhaps with actual examples or with bibliographic
citations. "Contact with the user community" is certainly considered a plus.
 Basically, you need to be able to show that your characters are useful for
writing text in some form and that they fill a genuine need. As far as I can
tell, there is no "economic justification" blank to fill out in which you
state that the proposed characters should be added so that your company will
make more money.

Why should it be different for new encoding forms? UTF-16 was invented
because there was a need to address more than 65,536 code points within a
16-bit framework, and the segment-offset model seemed most sensible. Was
there any one company that "pushed the proposal" for UTF-16?

UTF-8 was invented for a specific product (Plan 9), but it addressed a
widespread need for an 8-bit-compatible encoding that extended far beyond the
private needs of Bell Labs. Indeed, a previous attempt had been made (UTF-1)
but it was found inadequate in certain regards, and UTF-8 improved on it.

UTF-32 needs no explanation. If you have the luxury of allocating four bytes
per character, UTF-32 is clearly the most straightforward way to do it. Even
the BE/LE and BOM/non-BOM variations were not driven by one company alone for
their own profit, although they are sometimes represented as such.

What's wrong with asking the database vendors to refine their notion of
"sort" so the sorting comes out right for Unicode? The standard library for
the C language includes a function called "qsort" to which the programmer
must supply a comparison function, so the qsort algorithm knows what it means
for one element to be "less than" or "greater than" another. This is
available even on compilers for the lowly, reviled MS-DOS platform. Every
month at work, I take a set of files generated by an Informix database
running under HP-UX and re-sort the bloody things to at least a minimally
acceptable collation order, so that (e.g.) accented characters sort with
their unaccented counterparts. (There's more to it than that, but you get
the idea.) The database programmer has told me repeatedly that Informix
can't sort in anything other than straight binary order. Why not? Is it
less powerful than (heh heh) my C program running at the command prompt?

I know there has to be a certain reasonable amount of pragmatism involved in
formulating standards. This was highlighted by the recent discussion about
how developers 10 years ago needed to be soft-sold on the idea of 16-bit
Unicode characters, and how UTF-16 grew out of that. But with UTF-8S, it
looks like we are taking the worst feature of UTF-16 (the code points used by
the surrogates), adding a pinch of laziness and a dash of complacency, and
trying to sell the result to a UTC that really should have higher standards
for its own creation.

I don't (and shouldn't) have the ability to pressure the UTC to approve a new
encoding form to make up for my inability to conform to the existing ones,
and neither should anyone else. Sorting UTF-16 so that the values are in
scalar order -- or sorting UTF-8 so that the values are in UTF-16 order -- is
not difficult at all. There's a lot about Unicode that's harder to implement
than this. I would really hate to see what bizarre things might be proposed
next if UTC sets a precedent by approving UTF-8S.

-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT