I believe that John has it exactly right. The variant form of UTF-8 (call it
UTF-8n for now) is not designed for general transmission, but is a special
format for serialized Java Strings.
Notice also, as on page A-8 of the Unicode Standard V2.0, that receiving
implementations do not have to check that the shortest implementation is being
used when converting. For those implementations, UTF-8n--although out of
spec--will be converted correctly.
However, any implementation that did not just convert UTF-8 into 16-bit
Unicode, and was handed UTF-8n text purporting to be UTF-8 text could end up
with non-uniqueness problems.
John Cowan wrote:
> Elliotte Rusty Harold scripsit:
> > As you may or may not know, Java's UTF-8 encodes the null character, ASCII
> > 0, in two bytes rather than one as it should according to the UTF-8
> > specification.
> Not so fast. Java uses this encoding internally to represent Strhings,
> and provides the readUTF() and writeUTF() methods to export it to
> binary files. But those methods are not meant for general purposes:
> they are meant to provide save/restore for String objects, as is
> indicated by the use of a 4-byte length (big-endian) before each
> modified UTF-8 content.
> The proper Java machinery for handling character encodings uses
> standard UTF-8 rules (that is, InputStreamReader for input and
> OutputStreamWriter for output: these classes convert between
> byte streams and character streams).
> > 1. Will using Java's UTF-8 format produce problems for any software
> > anyone's aware of?
> Definitely. Software assuming that U+0000 can only be encoded as \0x00
> may miss "stealth" nulls encoded against the UTF-8 rules.
> > 2. In general, is it always acceptable to encode a one-byte character in
> > two or three bytes? or a two-byte character in three bytes?
> > 3. Does anyone know why Java does not want to encode the 0 character as a
> > single byte? In other words, is there any reason why a stream should not
> > contain embedded nulls?
> The main point is not the use in readUTF()/writeUTF(), but in the
> internal representation. For compatibility with C routines, Java
> Strings are stored in a guaranteed null-free representation so that
> trailing 0x00 bytes can be used as C end-of-string indicators.
> Since the machinery for processing mutated UTF must exist in every
> JVM anyway, it was natural to use it for reading and writing Strings
> as well. Note that the length values allow 0x00's to appear in the
> stream anyway!
> John Cowan firstname.lastname@example.org
> e'osai ko sarji la lojban.
-- business: email@example.com, firstname.lastname@example.org personal: email@example.com, http://www.macchiato.com --
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT