Re: FSS-UTF, UTF-2, UTF-8, and UTF-16

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Mon Jun 18 2001 - 21:04:29 EDT


There is one statement that appears to want to be framed:

Jianping Yang wrote:
> [...] When Unicode
> came to version 2.1, we found our AL24UTFFSS had trouble for 2.1 as Hangul's
> reallocation, and we could not simply update AL24UTFFSS to 2.1 definition as it
> would mess existing users' data in their database. So we came up with a new
> character set as UTF8 which is still 3-byte encoding to support Unicode 2.1. [...]

This means that Oracle mis-implemented the UTF-8 standard as it was specified at that time, starting at least with Unicode 2.0.

UTF-8 as a part of the ISO/Unicode standards encoded UCS-4 units=Unicode scalar values in up to 6 bytes. Unicode scalar values reached up to 0x10ffff, which required 4-byte UTF-8. This was with Unicode 2.0 in 1996.

According to the above, Oracle implemented its new "UTF8" with the intention of implementing Unicode 2.1, and did not, in fact, follow the then-specifications.

With the earliest implementations of Oracle "UTF8" not following their own intentions, it might be possible to supply a bug fix for older Oracle versions, if only to detect surrogates as Tex suggested.

markus



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT