Re: Oracle and Surrogate Pairs

From: Peter_Constable@sil.org
Date: Tue Jul 25 2000 - 13:52:03 EDT


>As Oracle UTF8 character set definition supports surrogates by a pairs of
two
>3-bytes to be sync with UTF-16 in binary sorting and code point,

This in not a conformant representation.

D29 (p. 46) states that a UTF "transforms each Unicode scalar value into a
unique sequence of code values". Am I not right in saying that xD800 -
xDFFF are not valid Unicode scalar values? (If so, then three bytes that
map to one of these values are not valid UTF-8.) The text after the
definition states that "...invalid scalar values include... unpaired
surrogates" and here we'd be dealing with paired surrogates. But the usage
described above is mapping individual surrogate code values to a UTF-8
sequence, and that seems to be invalid. Furthermore, D29 requires unique
mappings. If we allow both 4-byte and 6-byte representations for a given
non-BMP character, that condition is violated. This also violates the
specification of D36, which refers to table 3-1, and also the normative
text below that says, "when converting a Unicode scalar value to UTF-8, the
shortest form that can represent those values shall be used."

So, if you're representing non-BMP characters in Oracle using quasi-UTF-8
sequences that are six bytes long, you are not conforming to the spec for
UTF-8, and your software is not conformant to the Unicode standard (or to
ISO 10646). Sorry for the bad news...

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT