Re: UTF-8, U+0000 and JDK

From: Markus Kuhn (Markus.Kuhn@cl.cam.ac.uk)
Date: Sun Sep 26 1999 - 16:02:13 EDT


"Valeriy E. Ushakov" wrote on 1999-09-26 17:10 UTC:
> > U+0000 = c0 80
>
> I belive that's exactly what JDK uses to encode U+0000 in utf-8
> encoded NUL terminated C strings to distinguish U+0000 which is part
> of a string from the terminating NUL.

It probably would help to avoid confusion, if the Java documentation
introduced a new name for this encoding. Good and clear terminology is
never a bad thing.

Suggestion:

  UTF-8Z = zero-free UTF-8 encoding, which differs from
           UTF-8 only for one character, namely U+0000 = c0 80

But then, Java uses UTF-8Z only as an internal encoding, and not in its
UTF-8 I/O functions.

I think, is was a curious design decision:

I probably would have selected U+0000 = fe. This is as malformed as
c0 80, but has the big advantage that UTF-8 and UTF-Z would then always
have had the same length. Note that fe and ff are unused in UTF-8.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT