RE: UTF-16 problems

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Tue Jun 12 2001 - 13:05:08 EDT


Toby,

I agree that there is a need to preserve standards. Oracle did not support
surrogates. If you passed it a UTF-16 data stream it would not be converted
into proper UTF-8 encoding. At this juncture it should have fixed UTF8.
This would have worked with the old data because it had no non-plane0 codes.
You would have had backwards compatibility.

This is the documentation

"
  UTF8

  The UTF8 character set encodes characters in one to three bytes.
  Surrogate pairs require six bytes.

  AL32UTF8

  The AL32UTF8 character set encodes characters in one to three bytes.
  Surrogate pairs require four bytes.
"

If asked to build a database for UTF-8 support which do you think a DBA
would use? Do they know what surrogates are or if they should be encoded
with 4 or 6 bytes?

>I equate this issue identically to the Unicode Consortium's refusal
>to change UCD names even when they are incredibly misleading, as
>is the case with U+20A0 EURO CURRENCY SIGN.

Your point is well taken. I agree that the impact of changing the name to
"OBSOLETE EURO CURRENCY SIGN" or somthing similar is far less than keeping
it and confusing users. The same applies to Oracle. The question is how to
recover from a bad decision.

1) First explain the implications in the documentation. For example:

  UTF8

  The UTF8 character set encodes the first 65535 Unicode characters in one
to three
  bytes as standard UTF-8 characters. Higher Unicode characters that use
UTF-16
  surrogate pairs require six bytes. This is a non-standard UTF-8
  encoding that is used to produce data that sorts in the same sequence as
UTF-16.

  AL32UTF8

  The AL32UTF8 character set encodes all Unicode characters in one to four
bytes.
  This uses standard UTF-8 encoding for all Unicode characters. This sorts
in UTF-32
  (Standard Unicode code point) sort order.

2) In future releases change the name UTF8 to AL16UTF8. This should only
affect the DBAs who build and maintain the databases. This will at least
set the two on equal footing. This name change should not be a major
compatibility impact. They could even make UTF8 an alias of AL16UTF8 for a
few releases.

Carl

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On
Behalf Of toby_phipps@peoplesoft.com
Sent: Monday, June 11, 2001 8:41 PM
To: Michael (michka) Kaplan
Cc: Jianping.Yang@oracle.com; rick@unicode.org; unicode@unicode.org
Subject: Re: UTF-16 problems

Jianping Yang <Jianping.Yang@oracle.com> wrote:
>>So far, I can claim that only Oracle provides fully UTF-8 and
>> UTF-16 support for RDBMS, but unfortunately, as we cannot change the
exiting
>> utf8 definition from Oracle 8i as backward compatibility, we have to use
a new
>> character set name for it as AL32UTF8.

Michael (mitchka) Kaplan <mitchka@trigeminal.com> wrote:
>As many have pointed out, THIS will cause more confusion than just about
>anything else. Tex is the only one who said anything but he is not the
only
>one to believe you are seriously undermining the standard with this
>decision. It certainly does a lot to hurt interoperability.

Yes, it will cause confusion, however stability, and 100% backwards
compatibility is an overriding concern. I'd choose a little confusion
anytime if given the choice between confusion and breaking products that
depend on you.

Just like systems build dependence on UCD character names, users of
database systems build dependence on vendor naming conventions. Changing
core API name references is not something that any responsible vendor would
do without overwhelming support from their customer base, and since the
database character set is chosen once per database installation, and is not
visible to the average user, I see no overwhelming reason for Oracle to
change this. I admit, it is confusing at first, however they do have it
well documented (and I can only assume it will be documented with even
greater clarity in their 9i release where many additional Unicode features
have been added), and they also support the true, correct UTF-8 definition
as per ISO 10646 and TUS 3.0.

I equate this issue identically to the Unicode Consortium's refusal to
change UCD names even when they are incredibly misleading, as is the case
with U+20A0 EURO CURRENCY SIGN. This is obviously not the "Euro currency
sign" regardless of its name. The description points to the appropriate
character for the real sign. Oracle's had to do the same thing with their
UTF8 character set to ensure backwards compatibilty and stability - leave
it as-is, but document very clearly that it may not be what the user
expects, and points them to an alternative character set setting
(AL32UTF8).

Toby.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT