Re: UTF-8, Uí?? and Software Development (was: Re: New UTF-8 decoder stress test file)

From: Mark E. Davis (markdavis@ispchannel.com)
Date: Mon Sep 27 1999 - 03:55:55 EDT

Next message: Mark E. Davis: "Re: Products supporting Unicode"
Previous message: Paul Keinanen: "Re: UTF-8, U+0000 and Software Development (was: Re: New UTF-8 decoder stress test file)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Paul Keinanen wrote:

> [...]
>
> So I still do not understand why someone would insist of using wide
> null terminated C-strings in Java for non-plain text data (and use
> UTF-8 look alike multibyte style instead of UCS-2 for internal
> permanent storage, but that is an other question).

That's not the issue.

Java strings do not use null (U+0000) as a terminator; in fact, they are legal
characters anywhere in a string. I don't know where you got that impression. When
serializing string data, the Unicode in a Java string is converted to a modified
UTF-8. For ease of interworking with C string routines the longer form of the
null is used, since then C doesn't break it up into multiple strings.

However -- as has been repeated on this list at least once per year -- these
modified UTF-8 routines are not the ones that Java uses for character conversion;
they are used purely for the Java internal serialization of character data, as
described in DataOutput.writeUTF and DataInput.readUTF. For backwards
compatibility, the names need to be maintained, but the Sun documentation clearly
states that they are modified UTF-8 formats. This modified format also has the
length in two bytes preceding the character data, so it is clearly does not just
do a plain UTF-8 conversion.

As discussed on the Unicode FAQ, a conformant process actually can convert [C8
80] to U+0000 when going from UTF-8 to Unicode. On the other hand, a conformant
process cannot convert U+0000 to [C8 80] when going from Unicode to UTF-8.
However, there is nothing to stop a conformant process from generating a modified
UTF-8 conversion for specialized purposes, so long as it doesn't purport to be
generating standard UTF-8 and clearly states that it is a modification and not
the standard form. Admittedly, it would be wiser to at least title it something
different (UTF-8X?) in documation for clarity.

Next message: Mark E. Davis: "Re: Products supporting Unicode"
Previous message: Paul Keinanen: "Re: UTF-8, U+0000 and Software Development (was: Re: New UTF-8 decoder stress test file)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT