RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Mon May 28 2001 - 17:36:40 EDT


Doug,

The problem with databases is that you have to have a locale independent
sorting sequence. If you store a record with a key built with one locale,
you might not be able to retrieve it using another locale sort sequence.

The problem with Oracle is that they use both UCS-2 and UTF-8.

UTF-8 is simple to implement because you can use all the multi-byte code
page code with another algorithm for character length calculations. Thus is
does not take much to implement UTF-8 support.

The problem is that it is wasteful of space. For CLOBs where the field are
very large allocating 4 bytes per character, wastes space so they used
UCS-2. Converting from UCS-2 to UTF-16 creates a sorting problem. UTF-16
keys and UTF-8 keys have different sorting sequences.

UTF-8s would have put the entire surrogate support into the hands of the
application.

Converting UCS-2 to UTF-16 support is a lot of work because most to
operation are actually using UTF-32. This will match UTF-8 sorting.

The UTF-8s "short cut" in the long run this is a bad idea. It makes most
proper locale based operations a real problem. It also can create storage
problems because the UTF-8s characters can be 50% larger that the UTF-8
characters. Oracle does not appreciate the problem that clients have in
sizing fields with their current UTF-8 implementation. It would be worse
with UTF-8s. But no matter since the is only an implementation problem not
a database problem.

Carl

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of DougEwell2@cs.com
Sent: Monday, May 28, 2001 3:30 AM
To: unicode@unicode.org
Cc: Peter_Constable@sil.org
Subject: Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and
email)

In a message dated 2001-05-26 16:00:47 Pacific Daylight Time,
Peter_Constable@sil.org writes:

> The issue is this: Unicode's three encoding forms don't sort in the same
> way when sorting is done using that most basic and
> valid-in-almost-no-locales-but-easy-and-quick approach of simply
comparing
> binary values of code units. The three give these results:
>
> UTF-8: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)
> UTF-16: (U+0000 - U+D7FF), (surrogate), (U+E000-U+FFFF)
> UTF-32: (U+0000 - U+D7FF), (U+E000-U+FFFF), (surrogate)

First, everyone take a breath and say it out loud: "UTF-16 is a hack."
There, doesn't that feel better? Whether it is necessary, beneficial, or
unavoidable is beside the point. Using pairs of 16-bit "surrogates"
together
with an additive offset to refer to a 32-bit value may be a clever solution
to the problem, but it is still a hack, especially when those surrogate
values fall in the middle of the range of normal 16-bit values as they do.

UTF-8 and UTF-32 should absolutely not be similarly hacked to maintain some
sort of bizarre "compatibility" with the binary sorting order of UTF-16.
Anyone who is using the binary sorting order of UTF-16, and thus concludes
that (pardon the use of 10646 terms here) Planes 1 through 16 should be
sorted after U+D7FF but before U+E000 is really missing the point of proper
collation. I would state the case even more strongly than Peter, to say
that
such a collation order is valid in NO locale at all.

If developers expect to sort Unicode text in any meaningful way, they should
be using the Unicode Collation Algorithm (UAX #10). Using strict code point
order as a basis for sorting is generally not appropriate, and applying the
UTF-16 transformation as a further basis for sorting only compounds the
error.

UTC should not, and almost certainly will not, endorse such a proposal on
the
part of the database vendors.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT