Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Wed May 30 2001 - 19:46:29 EDT


Simon,

Would you care to answer (officially) why exactly Oracle needs for anything
to be done here? Per the spec, it is not illegal for a process to interpret
5/6-byte supplementary characters; it is only illegal to emit them. It seems
that Oracle and everyone else is well covered with the existing standard and
spec:

1) Any time Oracle insists on emiting the incorrect form, a process can
accept it and read it. Thus everything will work.

2) Any time Oracle picks up legal and proper UTF-8, it will (per your words
below) read it correctly and once again things will work.

3) Any time someone emits the b

michka

----- Original Message -----
From: "Simon Law" <simon.law@oracle.com>
To: <unicode@unicode.org>
Sent: Wednesday, May 30, 2001 11:01 AM
Subject: Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

> Hi Folks,
>
> Over the last few days, this email thread has generated many interesting
> discussions on the proposal of UTF-8s. At the same time some speculations
have
> been generated on why Oracle is asking for this encoding form. I hope to
clarify
> some of these misinformation in this email.
>
> In Oracle9i our next Database Release shipping this summer, we have
introduced
> support for two new Unicode character sets. One is 'AL16UTF16' which
supports
> the UTF-16 encoding and the other is 'AL32UTF8' which is the UTF-8 fully
> compliant character set. Both of these conform to the Unicode standard,
and
> surrogate characters are stored strictly in 4 bytes. For more information
on
> Unicode support in Oracle9i , please check out the whitepaper "The power
of
> Globalization Technology" on
> http://otn.oracle.com/products/oracle9i/content.html
>
> The requests for UTF-8s came from many of our Packaged Applications
customers
> (such as Peoplesoft , SAP etc.), the ordering of the binary sort is an
important
> requirement for these Oracle customers. We are supporting them and we hope
to
> turn this into a TR such that UTF-8s can be referenced by other vendors
when
> they need to have compatible binary order for UTF-16 and UTF-8 across
different
> platforms.
>
> The speculation that we are pushing for UTF-8s because we are trying to
minimize
> our code change for supporting surrogates, or because of our unique
database
> design are totally false. Oracle has a fully internationalized
extensible
> architecture and have introduced surrogate support in Oracle9i. In fact
we are
> probably the first database vendor to support both the UTF-16 and UTF-8
encoding
> forms, we will continue to support them and conform to future enhancements
to
> the Unicode Standard.
>
> Regards
>
>
> Simon
>
> "Carl W. Brown" wrote:
>
> > Ken,
> >
> > I suspect that Oracle is specifically pushing for this standard because
of
> > its unique data base design. In a sense Oracle almost picks it self up
by
> > its own bootstraps. It has always tried to minimize actual code.
Therefore
> > it was a natural choice to implement Unicode with UTF-8 because it is
easy
> > to reuse the multibyte support with minor changes to handle a different
> > character length algorithm. This has been one of the reasons that
Oracle
> > has been successful. Its tinker toy like design has enabled them to
quickly
> > adapt and add new features. Now however, they should take the time do
"do
> > it right". Its UTF-8 storage creates problems for database designers
> > because they can not predict field sizes. This is a problem with MBCS
code
> > pages but UTF-8s will make it worse. There will be lots of wasted
storage
> > when characters can vary in size from 1 to 6 bytes.
> >
> > Most other database systems require specific code to support Unicode.
As a
> > consequence most have implemented using UCS-2. Their migration is
obviously
> > to use UTF-16. UTF-8s buys them nothing but headaches.
> >
> > Carl
> >
> > -----Original Message-----
> > From: Kenneth Whistler [mailto:kenw@sybase.com]
> > Sent: Tuesday, May 29, 2001 3:47 PM
> > To: cbrown@xnetinc.com
> > Cc: unicode@unicode.org; kenw@sybase.com
> > Subject: RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and
> > email)
> >
> > Carl,
> >
> > > Ken,
> > >
> > > UTF-8s is essentially a way to ignore surrogate processing. It allows
a
> > > company to encode UTF-16 with UCS-2 logic.
> > >
> > > The problem is that by not implementing surrogate support you can
> > introduce
> > > subtle errors. For example it is common to break buffers apart into
> > > segments. These segments may be reconcatinated but they may be
processed
> > > individually.
> >
> > You are preaching to the choir here. I didn't state that *I* was in
> > favor of UTF-8S -- only that we have to be careful not to assume that
> > UTC will obviously not support it. The proponents of UTF-8S are
> > vigorously and actively campaigning for their proposal. In
> > standardization committees, proposals that have committed, active
> > proponents who can aim for the long haul, often have a way of getting
> > adopted in one form or another, unless there are equally committed
> > and active opponents of the proposal. It is just the nature of
> > consensus politicking in these committees, whether corporate based
> > or national body based.
> >
> > Also, I consider the stated position of "near-universal agreement
> > among the database vendors" to be largely a rhetorical device by
> > the proponents. Oracle is clearly pushing the proposal. NCR has
> > stated it is not in favor of the proposal. The other big enterprise
> > database vendors are hedging their positions somewhat -- in
> > particular, the standards people in those companies may not be
> > entirely in agreement with some of their database engine developers, for
> > example. And the small database vendors are either not playing
> > in this space or are part of desktop systems that will just follow
> > the behavior of the platforms.
> >
> > --Ken
>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT