RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Thu May 31 2001 - 12:25:25 EDT

Next message: Marco Cimarosti: "RE: Some Char. to Glyph Statistics, Pan/Single Font"
Previous message: Jonathan Rosenne: "RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)"
In reply to: Simon Law: "Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Next in thread: B: "RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Simon,

I now see that you support both "UTF8" where surrogates are encoded as 6
bytes and "AL32UTF8" where surrogates are encoded as 4 bytes. The way your
documentation reads many users are likely to select "UFT8" over "AL32UTF8".
You should have users who already have UTF8 databases migrate to the proper
UTF8 encoding rather than making them the exception to the rule.

If you have this funny encoding please don't call it UTF8 because it is not
UTF8 and will only confuse users. You could call it OTF8 or something like
that but not UTF8.

Carl
  -----Original Message-----
  From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of Simon Law
  Sent: Wednesday, May 30, 2001 11:02 AM
  To: unicode@unicode.org
  Subject: Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and
email)

Hi Folks,
Over the last few days, this email thread has generated many interesting
discussions on the proposal of UTF-8s. At the same time some speculations
have been generated on why Oracle is asking for this encoding form. I hope
to clarify some of these misinformation in this email.

In Oracle9i our next Database Release shipping this summer, we have
introduced support for two new Unicode character sets. One is 'AL16UTF16'
which supports the UTF-16 encoding and the other is 'AL32UTF8' which is the
UTF-8 fully compliant character set. Both of these conform to the Unicode
standard, and surrogate characters are stored strictly in 4 bytes. For more
information on Unicode support in Oracle9i , please check out the whitepaper
"The power of Globalization Technology" on
http://otn.oracle.com/products/oracle9i/content.html

The requests for UTF-8s came from many of our Packaged Applications
customers (such as Peoplesoft , SAP etc.), the ordering of the binary sort
is an important requirement for these Oracle customers. We are supporting
them and we hope to turn this into a TR such that UTF-8s can be referenced
by other vendors when they need to have compatible binary order for UTF-16
and UTF-8 across different platforms.

The speculation that we are pushing for UTF-8s because we are trying to
minimize our code change for supporting surrogates, or because of our
unique database design are totally false. Oracle has a fully
internationalized extensible architecture and have introduced surrogate
support in Oracle9i. In fact we are probably the first database vendor to
support both the UTF-16 and UTF-8 encoding forms, we will continue to
support them and conform to future enhancements to the Unicode Standard.

Regards

Simon

"Carl W. Brown" wrote:

    Ken,
    I suspect that Oracle is specifically pushing for this standard because
of
    its unique data base design. In a sense Oracle almost picks it self up
by
    its own bootstraps. It has always tried to minimize actual code.
Therefore
    it was a natural choice to implement Unicode with UTF-8 because it is
easy
    to reuse the multibyte support with minor changes to handle a different
    character length algorithm. This has been one of the reasons that
Oracle
    has been successful. Its tinker toy like design has enabled them to
quickly
    adapt and add new features. Now however, they should take the time do
"do
    it right". Its UTF-8 storage creates problems for database designers
    because they can not predict field sizes. This is a problem with MBCS
code
    pages but UTF-8s will make it worse. There will be lots of wasted
storage
    when characters can vary in size from 1 to 6 bytes.

    Most other database systems require specific code to support Unicode.
As a
    consequence most have implemented using UCS-2. Their migration is
obviously
    to use UTF-16. UTF-8s buys them nothing but headaches.

Carl

    -----Original Message-----
    From: Kenneth Whistler [mailto:kenw@sybase.com]
    Sent: Tuesday, May 29, 2001 3:47 PM
    To: cbrown@xnetinc.com
    Cc: unicode@unicode.org; kenw@sybase.com
    Subject: RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and
    email)

Carl,

> Ken,
>
> UTF-8s is essentially a way to ignore surrogate processing. It allows
a
> company to encode UTF-16 with UCS-2 logic.
>
> The problem is that by not implementing surrogate support you can
introduce
> subtle errors. For example it is common to break buffers apart into
> segments. These segments may be reconcatinated but they may be
processed
> individually.

    You are preaching to the choir here. I didn't state that *I* was in
    favor of UTF-8S -- only that we have to be careful not to assume that
    UTC will obviously not support it. The proponents of UTF-8S are
    vigorously and actively campaigning for their proposal. In
    standardization committees, proposals that have committed, active
    proponents who can aim for the long haul, often have a way of getting
    adopted in one form or another, unless there are equally committed
    and active opponents of the proposal. It is just the nature of
    consensus politicking in these committees, whether corporate based
    or national body based.

    Also, I consider the stated position of "near-universal agreement
    among the database vendors" to be largely a rhetorical device by
    the proponents. Oracle is clearly pushing the proposal. NCR has
    stated it is not in favor of the proposal. The other big enterprise
    database vendors are hedging their positions somewhat -- in
    particular, the standards people in those companies may not be
    entirely in agreement with some of their database engine developers, for
    example. And the small database vendors are either not playing
    in this space or are part of desktop systems that will just follow
    the behavior of the platforms.

--Ken

Next message: Marco Cimarosti: "RE: Some Char. to Glyph Statistics, Pan/Single Font"
Previous message: Jonathan Rosenne: "RE: RECOMMENDATIONs( Term Asian is not used properly on Computers and NET)"
In reply to: Simon Law: "Re: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Next in thread: B: "RE: ISO vs Unicode UTF-8 (was RE: UTF-8 signature in web and email)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT