Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

From: John O'Conner (john.oconner@eng.sun.com)
Date: Mon Feb 05 2001 - 12:32:10 EST


Within a String, the encoding of char values is practically irrelevant. It is a
hidden encoding that is never exposed to the user...or developer. When you access
String char values, you use an index to 16-bit Unicode values. To my knowledge,
Sun does not claim that its internal encoding of String is UTF-8 in any of its API
documentation.

Any component or converter that claims to produce a UTF-8 encoding should not
behave as you describe. For example, Java's UTF-8 converter does not encode U+0000
as 0xC0 0x80. If it ever does, please file a bug.

Regards,
John O'Conner

DougEwell2@cs.com wrote:

> This is laziness, intended to get around the "problem" of supplementary code
> points instead of handling them like any other code points. This reminds me
> of the Java bastardization of UTF-8, in which U+0000 is encoded 0xC0 0x80 so
> that no character string will ever contain the byte 0x00. (Nobody has ever
> explained to me why a character string would contain U+0000 in the first
> place.)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT