Re: Oracle and Surrogate Pairs

From: Mark Davis (markdavis@ispchannel.com)
Date: Wed Jul 26 2000 - 00:28:44 EDT


You could define a UTF that mapped scalar values below FFFF to the same as
UTF-8, and values above FFFF to a 6 byte value. It would *not* be UTF-8, but it
can be well defined.

If you look below D29 -- p. 46 at the first full paragraph -- you find that for
round tripping, UTFs are required to map invalid Unicode scalar values,
including D800..DFFF, FFFE, FFFF, and the other values of the form xxFFFE and
xxFFFF.

Uniqueness is in the mapping from Unicode scalar values to bytes, not
necessarily the other way around. However, any bytes that don't round-trip are
an irregular sequence. There is more on them on those two pages.

Mark

Peter_Constable@sil.org wrote:

> >As Oracle UTF8 character set definition supports surrogates by a pairs of
> two
> >3-bytes to be sync with UTF-16 in binary sorting and code point,
>
> This in not a conformant representation.
>
> D29 (p. 46) states that a UTF "transforms each Unicode scalar value into a
> unique sequence of code values". Am I not right in saying that xD800 -
> xDFFF are not valid Unicode scalar values? (If so, then three bytes that
> map to one of these values are not valid UTF-8.) The text after the
> definition states that "...invalid scalar values include... unpaired
> surrogates" and here we'd be dealing with paired surrogates. But the usage
> described above is mapping individual surrogate code values to a UTF-8
> sequence, and that seems to be invalid. Furthermore, D29 requires unique
> mappings. If we allow both 4-byte and 6-byte representations for a given
> non-BMP character, that condition is violated. This also violates the
> specification of D36, which refers to table 3-1, and also the normative
> text below that says, "when converting a Unicode scalar value to UTF-8, the
> shortest form that can represent those values shall be used."
>
> So, if you're representing non-BMP characters in Oracle using quasi-UTF-8
> sequences that are six bytes long, you are not conforming to the spec for
> UTF-8, and your software is not conformant to the Unicode standard (or to
> ISO 10646). Sorry for the bad news...
>
> - Peter
>
> ---------------------------------------------------------------------------
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT