Elliotte Rusty Harold <email@example.com> wrote:
> As you may or may not know, Java's UTF-8 encodes the null charactor,
> ASCII 0, in two bytes rather than one as it should according to the
> UTF-8 specification. The standard two-byte decoding algorithm
> should handle this case anyway.
In fact, the NUL character U+0000 seems to be the canonical example
to show how characters may and may NOT be encoded in UTF-8.
Technical Report #4, which defined Unicode 1.1 and introduced
FSS-UTF (the first version of UTF-8) said simply:
When there are multiple ways to encode a value, for example
U+0000, only the shortest encoding is legal.
whereas RFC 2279 (January 1998) goes into more detail:
NOTE -- actual implementations of the decoding algorithm above
should protect against decoding invalid sequences. For instance,
a naive implementation may (wrongly) decode the invalid UTF-8
sequence C0 80 into the character U+0000, which may have security
consequences and/or cause other problems.
In either case, the point is clear. The standard decoding algorithm
must be careful NOT to handle this case.
> 1. Will using Java's UTF-8 format produce problems for any software
> anyone's aware of?
Yes, if the software is written to the UTF-8 spec.
> 2. In general, is it always acceptable to encode a one-byte
> character in two or three bytes? or a two-byte character in three
Apparently it is NEVER acceptable. Java's implementation is not
> 3. Does anyone know why Java does not want to encode the 0
> character as a single byte? In other words, is there any reason
> why a stream should not contain embedded nulls?
The only time a UTF-8 stream would contain an embedded 0x00 would be
when the underlying Unicode text contains 0x0000. Why this perfectly
normal and appropriate use of a NUL would have to be concealed in an
escape sequence is beyond me.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT