L2/07-152 From: Doug Ewell Date: 2007-05-09 Subject: Re: Unicode Security Exploit In L2/07-116, Mark Davis proposes a deterministic mechanism for processes that interpret Unicode code unit sequences to handle ill-formed sequences. While I question whether implementers will uniformly adopt this mechanism, which requires the decoder to "push back" the first code point that identifies the sequence as invalid, it is a well-defined mechanism that resolves an ambiguity in the Standard. L2/07-134, written by Kent Karlsson, proposes some changes to L2/07-116: 1. It reverses the proposal to exclude the first "good" code point from the invalid sequences, instead leaving the interpretation up to the implementation. Making the interpretation consistent is the main point of L2/07-116. Either of the possible interpretations (include or exclude), applied consistently, would be better than formally establishing this as an implementation dependency. 2. It requires, as an alternative to aborting the interpretation process, the replacement of invalid sequences by sequences of U+001A rather than U+FFFD. The character U+001A, originally defined in ASCII as SUBSTITUTE, has led a double life for three decades as an end-of-file character in CP/M and MS-DOS systems, and there are still text processes today that stop reading a text stream upon encountering this character. The use of U+FFFD is preferable in this situation and is specifically mentioned in the conformance requirements. I recommend that these two provisions of L2/07-124 be rejected by the UTC and the corresponding provisions of L2/07-116 be approved. -- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages