Re: Bastardizations of UTF-8 (was: Re: [OT] Unicode-compatible SQL?)

From: John Cowan (jcowan@reutershealth.com)
Date: Mon Feb 05 2001 - 16:26:05 EST


Tex Texin wrote:

> I am not clear from your comments which is the bug, since the doc
> goes both ways. Are the doc bugs that they say
> it is UTF-8, or that they say it is modified UTF-8?

It uses modified UTF-8, modified in three ways:

1) U+0000 is encoded in two bytes as 0xc0 0x80;

2) values above U+FFFF are encoded in six bytes as the UTF-8 encoding
of their UTF-16 equivalent form;

3) the whole string is prefixed with a byte count represented
as a 2-byte big-endian binary integer.

> It would be great to learn that the functions are actually unmodified
> UTF-8, as I know of some interfaces that are writing non-Java
> code and are forced to deal with specialized handling of the modified
> UTF-8.
> It would be great to inform them they can use standard UTF-8 library
> routines.

*chomp* No such luck Doc!

-- 
There is / one art             || John Cowan <jcowan@reutershealth.com>
no more / no less              || http://www.reutershealth.com
to do / all things             || http://www.ccil.org/~cowan
with art- / lessness           \\ -- Piet Hein



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT