Re: Oracle and Surrogate Pairs

From: Jianping Yang (
Date: Tue Jul 25 2000 - 15:03:27 EDT

Not bad at all for Oracle as we get exact requirement from our application
vendors that they want to store the surrogate as 6 bytes in database so that
they can have the same semantics as UTF-16.

As for conforming, I don't think there is any issue here for the database
client if UTF-16 is used for client side. We will also support 4-byte UTF-8 at
client side in next release for the conforming if the user want to it.

Jianping. wrote:

> >As Oracle UTF8 character set definition supports surrogates by a pairs of
> two
> >3-bytes to be sync with UTF-16 in binary sorting and code point,
> This in not a conformant representation.
> D29 (p. 46) states that a UTF "transforms each Unicode scalar value into a
> unique sequence of code values". Am I not right in saying that xD800 -
> xDFFF are not valid Unicode scalar values? (If so, then three bytes that
> map to one of these values are not valid UTF-8.) The text after the
> definition states that "...invalid scalar values include... unpaired
> surrogates" and here we'd be dealing with paired surrogates. But the usage
> described above is mapping individual surrogate code values to a UTF-8
> sequence, and that seems to be invalid. Furthermore, D29 requires unique
> mappings. If we allow both 4-byte and 6-byte representations for a given
> non-BMP character, that condition is violated. This also violates the
> specification of D36, which refers to table 3-1, and also the normative
> text below that says, "when converting a Unicode scalar value to UTF-8, the
> shortest form that can represent those values shall be used."
> So, if you're representing non-BMP characters in Oracle using quasi-UTF-8
> sequences that are six bytes long, you are not conforming to the spec for
> UTF-8, and your software is not conformant to the Unicode standard (or to
> ISO 10646). Sorry for the bad news...
> - Peter
> ---------------------------------------------------------------------------
> Peter Constable
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <>

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT