Re: Strange UTF-8 in Java

From: Doug Ewell (dewell@compuserve.com)
Date: Wed Sep 30 1998 - 04:02:32 EDT

Next message: Doug Ewell: "Question about SCSU example"
Previous message: Markus Kuhn: "UTF-8 support for xterm"
Maybe in reply to: Elliotte Rusty Harold: "Strange UTF-8 in Java"
Next in thread: John Cowan: "Re: Strange UTF-8 in Java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Elliotte Rusty Harold <elharo@sunsite.unc.edu> wrote:

> As you may or may not know, Java's UTF-8 encodes the null charactor,
> ASCII 0, in two bytes rather than one as it should according to the
> UTF-8 specification. The standard two-byte decoding algorithm
> should handle this case anyway.

In fact, the NUL character U+0000 seems to be the canonical example
to show how characters may and may NOT be encoded in UTF-8.
Technical Report #4, which defined Unicode 1.1 and introduced
FSS-UTF (the first version of UTF-8) said simply:

When there are multiple ways to encode a value, for example
U+0000, only the shortest encoding is legal.

whereas RFC 2279 (January 1998) goes into more detail:

   NOTE -- actual implementations of the decoding algorithm above
   should protect against decoding invalid sequences. For instance,
   a naive implementation may (wrongly) decode the invalid UTF-8
   sequence C0 80 into the character U+0000, which may have security
   consequences and/or cause other problems.

In either case, the point is clear. The standard decoding algorithm
must be careful NOT to handle this case.

Elliotte continued:

> 1. Will using Java's UTF-8 format produce problems for any software
> anyone's aware of?

Yes, if the software is written to the UTF-8 spec.

> 2. In general, is it always acceptable to encode a one-byte
> character in two or three bytes? or a two-byte character in three
> bytes?

Apparently it is NEVER acceptable. Java's implementation is not
true UTF-8.

> 3. Does anyone know why Java does not want to encode the 0
> character as a single byte? In other words, is there any reason
> why a stream should not contain embedded nulls?

The only time a UTF-8 stream would contain an embedded 0x00 would be
when the underlying Unicode text contains 0x0000. Why this perfectly
normal and appropriate use of a NUL would have to be concealed in an
escape sequence is beyond me.

-Doug

Next message: Doug Ewell: "Question about SCSU example"
Previous message: Markus Kuhn: "UTF-8 support for xterm"
Maybe in reply to: Elliotte Rusty Harold: "Strange UTF-8 in Java"
Next in thread: John Cowan: "Re: Strange UTF-8 in Java"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:42 EDT