Strange UTF-8 in Java

From: Elliotte Rusty Harold (elharo@sunsite.unc.edu)
Date: Sun Sep 27 1998 - 10:25:56 EDT


As you may or may not know, Java's UTF-8 encodes the null charactor, ASCII
0, in two bytes rather than one as it should according to the UTF-8
specification. The standard two-byte decoding algorithm should handle this
case anyway. Nonetheless I'm wary since it does violato the "Be
conservative in what you write, be liberal in what you read" principle. So
my question is three fold:

1. Will using Java's UTF-8 format produce problems for any software
anyone's aware of?

2. In general, is it always acceptable to encode a one-byte character in
two or three bytes? or a two-byte character in three bytes?

3. Does anyone know why Java does not want to encode the 0 character as a
single byte? In other words, is there any reason why a stream should not
contain embedded nulls?

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@sunsite.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
| XML: Extensible Markup Language (IDG Books 1998) |
| http://www.amazon.com/exec/obidos/ISBN=0764531999/cafeaulaitA/ |
+----------------------------------+---------------------------------+
| Read Cafe au Lait for Java news: http://sunsite.unc.edu/javafaq/ |
| Read Cafe con Leche for XML news: http://sunsite.unc.edu/xml/ |
+----------------------------------+---------------------------------+



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT