Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

Date: Mon Feb 05 2001 - 11:44:59 EST

In a message dated 2001-02-05 5:19:59 Pacific Standard Time, writes:

> > I have heard a rumour (i.e. my source is not involved in the reported
> > activity) that:
> >
> > <quote>
> > SAP, PeopleSoft, Siebel, Oracle and others are actually
> > in the process of proposing a new format of UTF that will cause a UTF-16
> > surrogate pair to become two 3-byte UTF-8 codepoints so that UTF-8 will
> > have the same behaviour as UTF-16, that is, a surrogate will be two UTF-8
> > code points.
> > </quote>
> >
> > Can anyone corroborate this, and, if it's true, offer an opinion on it?

> Using UTF-8 to handle characters in the supplementary planes by way of
> using two separate code points in the surrogate range is NOT considered
> acceptable.
> Currently it is legal to interpret them but *not* to generate them
> refs on the Unicode site). Therefore, I hope you are mistaken about the
> rumor since this would be a Bad Thing (tm).

This is laziness, intended to get around the "problem" of supplementary code
points instead of handling them like any other code points. This reminds me
of the Java bastardization of UTF-8, in which U+0000 is encoded 0xC0 0x80 so
that no character string will ever contain the byte 0x00. (Nobody has ever
explained to me why a character string would contain U+0000 in the first

I have argued in the past that in some cases, semi-conformant Unicode
implementations might be better than non-Unicode solutions. But creating a
new UTF to get around your product's lack of real Unicode support *and then
expecting others to use your hack* is a different matter entirely. Just bite
the bullet and support UTF-8. It's not that hard.

-Doug Ewell
 Fullerton, California

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT