Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

From: Peter_Constable@sil.org
Date: Sat May 26 2001 - 18:43:20 EDT


>If you think something abominable is happening, please raise a loud voice
>and flood UTC members with e-mail and tell everyone what you think and why
>you think it. Nobody can hear you when you mumble.
>
>And it helps if you have solid technical and philosophical arguments to
convey.

Well, I wasn't going to elaborate (just been through this elsewhere)...

The Unicode flavour of UTF-8 only allows for sequences of up to four code
units in length to represent a Unicode character, in contrast to ISO's six,
the difference having to do with Unicode having limited the codespace to
U+10FFFF (whereas ISO 10646 formally includes a codespace up to U+7FFFFFFF,
but will be effectively restricting use to U+10FFFF).

BUT...

>There was another abomination proposed. Oracle rather than adding UTF-16
>support proposed that non plane 0 characters be encoded to an from UTF-8
by
>encoding each of the surrogate pairs into a separate UTF-8 character.

Yes, Oracle, PeopleSoft and SAP submitted a proposal to UTC to sanction
another encoding form, UTF-8S, that would encode supplementary plane
characters as six bytes, three corresponding to each of a UTF-16 high and
low surrogate. The rational had to do with having an 8-bit encoding form
that would "binary" sort in the same way as UTF-16.

(Warning: This gets a bit long. I'm doing this because I was advised not to
mumble but to speak up. :-)

The issue is this: Unicode's three encoding forms don't sort in the same
way when sorting is done using that most basic and
valid-in-almost-no-locales-but-easy-and-quick approach of simply comparing
binary values of code units. The three give these results:

UTF-8: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
UTF-16: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)

This is seen by the proposers to be a problem: if you have data from one
source in UTF-8 and another in UTF-16 and you sort the two, you'd like to
be able to compare results from each source and know that you're sorting
things that are comparable. By using a UTF-8 variation ("UTF-8S" in which
supplementary-plane characters are mapped first to UTF-16 surrogates and
from there to 8-bit code unit sequences), then the resulting ordering is:

UTF-8S: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
UTF-16: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)

The implication would be that we'd have *two* 8-bit encoding forms.

A suggestion was made on the unicoRe list by one of the proponents that all
encoding forms should "binary sort" in the same way; I found it surprising
that they were proposing a variant of UTF-8 rather than UTF-16, since
UTF-16 was the odd one out, and tweaking UTF-8 still leaves one encoding
form that "binary sorts" differently: UTF-32. Well, I was aghast when I
read the actual proposal to see that, not only are they suggesting that we
have a second 8-bit encoding form, "UTF-8S", but they also want to have
another 32-bit encoding form "UTF-32S". So, we'd end up with a total of 5
encoding forms: UTF-8, UTF-8S, UTF-16, UTF-32, UTF-32S.

According to the proposal, UTF-8S and UTF-32S would not have the same
status: they wouldn't be for interchange; they'd just be for representation
internal to a given system, like UTF-EBCDIC (which, I think I heard, has
not actually been implemented by IBM in any live systems).

What I don't get is this: if you want to implement something just inside
you're own system and you say you'll make sure nobody else ever sees it,
why do you need UTC to sanction it in any way?

The crux of the justification offered by Oracle et al is this argument,
which appears to me to be fallacious:

<quote>
Specifically, any system that must deal with the indexing and comparison of
large collections of data across multiple encodings will run into the issue
that data that is ordered based on the binary values in one encoding will
no longer be ordered such once transformed into another encoding.  While
this lack of binary ordering compatibility across different encodings is
very true and well-understood in the world of legacy encodings (such as a
transcode of Shift-JIS to EUC-JP), given that all the Unicode
Transformation Forms are maintained by a single committee, it should be
possible to come up with a common binary order between each of the three
main Unicode Transformation Forms.
</quote>

Summarising:

Premise A: the three main Unicode encoding forms are maintained by a single
committee
Claimed implication C: it should be possible to come up with a common
binary order between each of the three Unicode encoding forms

There is a missing and implied premise that is needed to make the
implication work:

Premise B: encoding forms maintained by a single committee should all yield
a common binary order.

This argument seems to me to be faulty in at least two ways:

First, it is clearly counterexemplified by existing situations
- e.g. existing Microsoft codepages (euro binary sorts before ellipsis in
cp1252 but after it in cp1251)
- I'm sure it wouldn't be hard to produce counterexamples from the work of
JTC1/SC2

Now, I think Oracle et al would offer as a rebuttal that Premise B was not
true in the past, but that it should be for Unicode. They offer no real
argumentation as to why this should be the case, though. They simply assume
that this will be easier for everyone (not at all obvious). I'll revise
Premise B to reflect this:

Premise B (rev'd): Encoding forms maintained by UTC should all yield a
common binary order.

This is important, and will come back up later.

Secondly, the argument presupposes that the desired result as described in
C is (i) possible and (ii) achieved by their proposal. Their proposal does
not cause the three main encoding forms to yield a common binary order;
what their proposal does is introduce two new encoding forms (and give them
a somewhat ambiguous status) that will share a common binary order with
UTF-16. The existing encoding forms remain, and continue not to share a
common binary order. I think it is self-evident that the desired result
can, in fact, only be achieved in one of the following ways:

Drastic Measure 1: make UTF-16 obsolete; replace it with UTF-16a which
binary sorts supplementary-plane characters after U+E000..U+FFFF

Drastic Measure 2: make UTF-8 and UTF-32 obsolete; replace them with UTF-8
and UTF-32.

These are, of course, impossible (and fortunately Oracle et al are not
proposing either of these).

Now, I've argued against a strict interpretation of what they said. Let's
consider the spirit of what they're saying: to store data internally using
UTF-8S or UTF-32S so that they can compare binary-sort results with data
sources encoded using UTF-16. The proposal involves encoding forms with
very ambiguous, quasi-official status:

<quote>
This paper... proposes to add at least one optional additional UTF, in the
form of a Unicode Technical Report (UTR).  This form could be implemented
by system designers where the benefit of a common binary ordering across
different UTFs is important from a performance standpoint, or for other
reasons.  The new UTF(s) would have equivalent standing as the UTF-EBCDIC
transformation currently maintained in UTR#16.  It is not proposed that the
new transformation form(s) become Standard Annexes (UAX), nor would they be
proposed for inclusion in ISO 10646.
</quote>

For sake of argument, I'll ignore this for the moment. They offer some
usage scenarios; I'll quote an excerpt of only the first (the other adds
nothing new to the argument for or against):

<quote>
UTF-8 database server ßà UTF-16 database client

A SQL statement executed on the database server returns a result set
ordered by the binary sort of the data in UTF-8, given that this is the
encoding of both data and indexes in the database.

A C/C++ or Java UTF-16-based client receives this result set and must
compare it to a large collection of data stored locally in UTF-16...
</quote>

This has to assume a closed system in which the server and client are
proprietary solutions using proprietary protocols for their interaction. I
say this because both are assuming Unicode is always binary sorted in an
order that results from UTF-16, and that's a proprietary assumption. To
make it otherwise either would require obsoleting UTF-8 and replacing it
with UTF-8S, or else would require making UTF-8S an *official Unicode
standard* protocol. So, they can't waffle on the status. If they don't want
real official standard status for this, then so much for open solutions in
which my client can talk to your server, or vice versa.

If they want to do this in a closed system, they can already just go ahead
an do it; they don't need UTC to give permission for what they do inside
their own systems. By proposing that this be documented and given a name,
evidently they want to be able to share the assumptions involved with
others, i.e. do this in an open context. Thus, even if they don't call it a
"standard Unicode encoding form", they're trying to treat it as such. So,
it seems to me that this proposal really is asking us to create new,
standardised encoding forms that need to be documented as UAXs. Either that
or to adopt Drastic Measure 1 or 2. (I don't think DM1/2 would be
considered for a moment by anybody, and Oracle et al explicitly rule that
out in their proposal.)

So, we're left with them asking us all to adopt a couple of additional
standard encoding forms. Do we really want five encoding forms (and eleven
encoding schemes)?

Even if we go along with this, there's still a problem: That UTF-8 DB
server in the scenario above (assuming a non-closed system) might be using
true UTF-8, or UTF-8S. (I'm sure there must be existing implementations of
clients or servers using UTF-16 and of clients or servers using real
UTF-8.) So, the proposal require not only two new encoding forms; in
addition the following are also necessary:

- A way to communicate between client and server what binary sorting
assumptions are being made.
- Both the client and server *still* need to be able to handle the
situation in which one is using UTF-16 and the other is using true UTF-8.

So, the proposed solution (following the spirit of the proposal) doesn't
eliminate the problem.

To summarise:

- Oracle et al want UTC to sanction two new encoding forms.

- These encoding forms would supposedly have some kind of ambguous,
quasi-official status.

- Making the proposal accomplish anything in open systems really requires
that the encoding forms have official standard status.

- Even so, the proposal does not eliminate the problem that it is supposed
to be addressing.

- The problem as stated (assuming Premise B) cannot be eliminated in open
systems without taking very drastic and impossible measures.

- The problem can be solved in closed systems without needing new encoding
forms sanctioned by UTC.

This whole basis of the problem hinges on Premise B. If we maintain Premise
B, then we end up with a situation that can in principle only be solved in
closed systems and, as such, don't require any new UTC-sanctioned encoding
forms (with whatever status). The attempt to solve the problem does not, in
fact, eliminate the problem, and gives us new encoding forms to worry
about.

The alternative is to reject premise B. That seems to me to be *a whole
lot* cleaner and easier.

The main point seems to be that Oracle et al want to maintain Premise B,
presumably because they think it would be easier. Yet I think I've shown
that it doesn't, both because it creates new encoding forms to deal with,
and because we still have to deal with the reality that the original
encoding forms still exist. Now, I'm not a database developer, so I need to
be careful since I can't presume to know the particular implementation
needs of such environments. But it seems to me that we've lived without
Premise B in the past, and that it won't benefit us to adopt it now. Why
bother with it? Why not continue doing what we already know how to do?

The only possible answer I can think of is out of concern that, in the case
of Unicode, some implementers may *assume* premise B to be true. Our
options, therefore, are twofold: to make Premise B in fact true -- but
we've seen that that's the harder road and doesn't benefit us after all --
or to make people understand that Premise B is false. People already need
to learn about Unicode in order to implement it; why can't they also learn
that Premise B is false? (This seems too easy; I must be missing
something.)

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT