L2/07-152

From: Doug Ewell
Date: 2007-05-09
Subject: Re: Unicode Security Exploit 

In L2/07-116, Mark Davis proposes a deterministic mechanism for 
processes that interpret Unicode code unit sequences to handle 
ill-formed sequences.  While I question whether implementers will 
uniformly adopt this mechanism, which requires the decoder to "push 
back" the first code point that identifies the sequence as invalid, it 
is a well-defined mechanism that resolves an ambiguity in the Standard.

L2/07-134, written by Kent Karlsson, proposes some changes to L2/07-116:

1.  It reverses the proposal to exclude the first "good" code point from 
the invalid sequences, instead leaving the interpretation up to the 
implementation.  Making the interpretation consistent is the main point 
of L2/07-116.  Either of the possible interpretations (include or 
exclude), applied consistently, would be better than formally 
establishing this as an implementation dependency.

2.  It requires, as an alternative to aborting the interpretation 
process, the replacement of invalid sequences by sequences of U+001A 
rather than U+FFFD.  The character U+001A, originally defined in ASCII 
as SUBSTITUTE, has led a double life for three decades as an end-of-file 
character in CP/M and MS-DOS systems, and there are still text processes 
today that stop reading a text stream upon encountering this character. 
The use of U+FFFD is preferable in this situation and is specifically 
mentioned in the conformance requirements.

I recommend that these two provisions of L2/07-124 be rejected by the 
UTC and the corresponding provisions of L2/07-116 be approved.

--
Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
http://users.adelphia.net/~dewell/
http://www1.ietf.org/html.charters/ltru-charter.html
http://www.alvestrand.no/mailman/listinfo/ietf-languages