Re: Over-long Control Characters in UTF-8

From: John Cowan (cowan@locke.ccil.org)
Date: Mon Aug 02 1999 - 10:08:51 EDT

Next message: schererm@us.ibm.com: "Re: Latin-1's apostrophe, grave accent, acute accent"
Previous message: Francois Yergeau: "Re: Over-long Control Characters in UTF-8"
Maybe in reply to: Markus Kuhn: "Over-long Control Characters in UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Markus Kuhn wrote:

> [...] LF =
> U+000A = 0x0a = 0xc0 0x8a = 0xe0 0x80 0x8a = 0xf0 0x80 0x80 0x8a = ...
> can be encoded in many ways legally under UTF-8 [...]

Not at all. The Unicode Standard (appendix A) says to encode
everything in the shortest way.

> The fact that Java abuses the 2-byte encoding of the U+0000 (0xc0 0x80)
> to get C string binary transparency for NUL has effectively established
> the practice of using overlong UTF-8 sequences as a hack. :-(

But only in their private protocol for reading and writing String
objects in *binary* files. The names readUTF() and writeUTF() are
misleading, in that UTF-8 is not read or written there. The
UTF-8 codec for InputStreamReader goes by the rules.

-- 
	John Cowan	http://www.ccil.org/~cowan	cowan@ccil.org
Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau,
Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies.
			-- Coleridge / Politzer

Next message: schererm@us.ibm.com: "Re: Latin-1's apostrophe, grave accent, acute accent"
Previous message: Francois Yergeau: "Re: Over-long Control Characters in UTF-8"
Maybe in reply to: Markus Kuhn: "Over-long Control Characters in UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT