Tex Texin wrote:
> I am not clear from your comments which is the bug, since the doc
> goes both ways. Are the doc bugs that they say
> it is UTF-8, or that they say it is modified UTF-8?
It uses modified UTF-8, modified in three ways:
1) U+0000 is encoded in two bytes as 0xc0 0x80;
2) values above U+FFFF are encoded in six bytes as the UTF-8 encoding
of their UTF-16 equivalent form;
3) the whole string is prefixed with a byte count represented
as a 2-byte big-endian binary integer.
> It would be great to learn that the functions are actually unmodified
> UTF-8, as I know of some interfaces that are writing non-Java
> code and are forced to deal with specialized handling of the modified
> It would be great to inform them they can use standard UTF-8 library
*chomp* No such luck Doc!
-- There is / one art || John Cowan <firstname.lastname@example.org> no more / no less || http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:18 EDT