Unicode Security Considerations

L2/07-186 From: Frank Ellermann Date: Apr 24, 2007 4:19 PM Subject: Re: Comments on Unicode Format for Network Interchange To: discuss@apps.ietf.org Markus Scherer wrote: > *** Suggested change: > 2. Line-endings MUST be indicated by the sequence Carriage-Return > (U+000D) followed by Line-Feed (U+000A), or by a single > Carriage-Return (U+000D), or by a single Line-Feed (U+000A). -1F > Justification: We believe that single CR and LF are common because of > implementation practice on a variety of platforms, and that it is both > unrealistic and unnecessary to try to legislate them away. No, it causes havoc. > Applications already commonly handle all of CR, LF and CR+LF, and some > support even more characters according to the Unicode Newline > Guidelines. The draft isn't about arbitrary text or XML (where you'd also need NEL), it's about telnet. It tries to extend ALPHA and DIGIT as used in some syntax constructs for text in Internet protocols, it doesn't try to introduce a new concept of "line" in these protocols. > *** Suggested change: > 4. The UTF-8 signature byte sequence (EF BB BF, UTF-8 encoding of > U+FEFF, sometimes called Byte Order Mark ("BOM")), when it > appears at the beginning of the text, SHOULD be deleted by the > recipient. I don't think that works. The draft isn't about local text or XML files, it's about Internet protocols, especially telnet, over the wire. > If a Word Joiner is needed in the text, U+2060 WORD JOINER SHOULD > be used instead of U+FEFF ZERO WIDTH NO-BREAK SPACE. Already covered by STD 63 (RFC 3629). > *** Suggested change: > 1. Control codes from both the "C0" (U+0000..U+001F, U+007F) > and "C1" (U+0080..U+009F) ranges, > with the exception of HT (09), LF (0A) and CR (0D), > SHOULD NOT be used unless required by exceptional circumstances. > Justification: The sets of C0 and C1 control codes that should and > should not be used should be defined explicitly, and with code point > values. Only HT, LF and CR are very widely used. Makes sense, but HT can have surprising effects if it's "expanded" into one or more spaces, that would need a "security consideration". Does DEL really belong to the C0 set ? Maybe avoiding these old terms is clearer for readers today. > *** Suggested change: > Remove points 2. and 3. See above. Try to edit a plain text file using LF as line-end with the tool styling itself as "editor" on a Windows box, and you'll see what I mean. IIRC there were some hot debates why the IETF ftp server sends text files with (only) LF, in theory breaking the ABNF in these files. This is a rathole, please just accept it as some IETF oddity. We're not forced to use CRLF in local files if we hate it. > *** Suggested change: > Drop this second bullet and the following paragraph. No, folks need to know that Unicode is a moving target to some degree, that only small and different subsets are supported by most devices, and that it's horribly complex in comparison with ASCII or many legacy charsets. The advantage is obvious, some disadvantages are not. > *** > Suggested change: Please add a reference for [RFC3629] UTF-8, a > transformation format of ISO 10646 There is a RFC 3629 (STD 63) reference, it's in the first part with the normative references. Frank -- Unicode Security Considerations