Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

Date: Tue Feb 06 2001 - 10:57:52 EST

Sadly, the readUTF/writeUTF methods are NOT the way to access
UTF-8 in Java. They are for sending serialized Strings to other Java
processes. This is documented by Sun (but very poorly): I bookmarked the
page on my other machine (the one in California, while I'm out of
town) because it was a surprise to me.

If you want UTF-8, it's an encoding: use a converter like any other

I'm not sure this is a bug, incidentally, because it means that all "I am
not a String" encodings are handled identically. The names are horrid and
the doc useless, tho'.

Best Regards,


Addison P. Phillips Globalization Architect
                                           webMethods, Inc. B2B Software Integration

+1 408.210.3569 (mobile) +1 408.904.4762 (fax)

On Mon, 5 Feb 2001, John Cowan wrote:

> Tex Texin wrote:
> > I am not clear from your comments which is the bug, since the doc
> > goes both ways. Are the doc bugs that they say
> > it is UTF-8, or that they say it is modified UTF-8?
> It uses modified UTF-8, modified in three ways:
> 1) U+0000 is encoded in two bytes as 0xc0 0x80;
> 2) values above U+FFFF are encoded in six bytes as the UTF-8 encoding
> of their UTF-16 equivalent form;
> 3) the whole string is prefixed with a byte count represented
> as a 2-byte big-endian binary integer.
> > It would be great to learn that the functions are actually unmodified
> > UTF-8, as I know of some interfaces that are writing non-Java
> > code and are forced to deal with specialized handling of the modified
> > UTF-8.
> > It would be great to inform them they can use standard UTF-8 library
> > routines.
> *chomp* No such luck Doc!
> --
> There is / one art || John Cowan <>
> no more / no less ||
> to do / all things ||
> with art- / lessness \\ -- Piet Hein

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT