Re: Over-long Control Characters in UTF-8

From: John Cowan (cowan@locke.ccil.org)
Date: Mon Aug 02 1999 - 10:08:51 EDT


Markus Kuhn wrote:

> [...] LF =
> U+000A = 0x0a = 0xc0 0x8a = 0xe0 0x80 0x8a = 0xf0 0x80 0x80 0x8a = ...
> can be encoded in many ways legally under UTF-8 [...]

Not at all. The Unicode Standard (appendix A) says to encode
everything in the shortest way.

> The fact that Java abuses the 2-byte encoding of the U+0000 (0xc0 0x80)
> to get C string binary transparency for NUL has effectively established
> the practice of using overlong UTF-8 sequences as a hack. :-(

But only in their private protocol for reading and writing String
objects in *binary* files. The names readUTF() and writeUTF() are
misleading, in that UTF-8 is not read or written there. The
UTF-8 codec for InputStreamReader goes by the rules.

-- 
	John Cowan	http://www.ccil.org/~cowan	cowan@ccil.org
Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau,
Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies.
			-- Coleridge / Politzer



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT