L2/07-126 From: Markus Scherer Date: Apr 23, 2007 10:48 AM Subject: Comments on Unicode Format for Network Interchange To: discuss@apps.ietf.org Dear Mr. Klensin and Mr. Padlipsky et al., I have reviewed and discussed your draft-klensin-net-utf8-03 with some colleagues. We welcome the standardization on UTF-8 as the default internet charset. We would like to make the following suggestions (each starting with *** and ending with *** *** among quotes from the internet-draft): [...] 2. Net-Unicode 2.1. Definition The Network Unicode (Net-Unicode) format is defined as follows: 1. Characters MUST be coded in UTF-8 as defined in [RFC3629]. 2. Line-endings MUST be indicated by the sequence Carriage-Return (U+000D) followed by Line-Feed (U+000A). *** Suggested change: 2. Line-endings MUST be indicated by the sequence Carriage-Return (U+000D) followed by Line-Feed (U+000A), or by a single Carriage-Return (U+000D), or by a single Line-Feed (U+000A). Justification: We believe that single CR and LF are common because of implementation practice on a variety of platforms, and that it is both unrealistic and unnecessary to try to legislate them away. Applications already commonly handle all of CR, LF and CR+LF, and some support even more characters according to the Unicode Newline Guidelines. *** *** 3. Before transmission, all character sequences MUST be normalized according to Unicode method "NFC" (see Section 3). *** Suggested change: 3. Before transmission, all character sequences SHOULD be normalized according to Unicode method "NFC" (see Section 3). Justification: With the MUST language in the draft, we see the following issues: * The draft later says that recipients should not just assume that incoming text is normalized. Therefore, recipients must already be prepared to at least check for normalization. -> We believe that the MUST is not useful. * The normalization requirement is the reason for the Unicode versioning and stability discussion below which complicates this internet-draft considerably. -> We believe that the MUST is not necessary. * The normalization stability restricts this specification to Unicode versions 3.2 and above (see section 4). -> We believe that this is too restrictive. Unicode applications normally handle text from Unicode 2.0 and above. * We believe that the MUST is unenforceable. Moreover, if recipients must check, it doesn't make any difference whether it is enforced. (With this change, much of the following text of the internet-draft can be simplified significantly. In particular, the discussions of unassigned/unknown characters, stabilized forms, etc. can and should be dropped.) *** *** 4. As suggested in Section 6 of RFC 3629, the Byte Order Mark ("BOM") signature MUST NOT appear at the beginning of these text strings. *** Suggested change: 4. The UTF-8 signature byte sequence (EF BB BF, UTF-8 encoding of U+FEFF, sometimes called Byte Order Mark ("BOM")), when it appears at the beginning of the text, SHOULD be deleted by the recipient. If a Word Joiner is needed in the text, U+2060 WORD JOINER SHOULD be used instead of U+FEFF ZERO WIDTH NO-BREAK SPACE. Justification: We believe that the draft text is unnecessarily strong, and at the same time not sufficiently specific for implementers. *** *** [...] 2.2. The ASCII NVT Definition [...] 1. The "defined but not required" codes -- BEL, BS, HT, VT, FF -- and the undefined control codes ("C0") SHOULD NOT be used unless required by exceptional circumstances. *** Suggested change: 1. Control codes from both the "C0" (U+0000..U+001F, U+007F) and "C1" (U+0080..U+009F) ranges, with the exception of HT (09), LF (0A) and CR (0D), SHOULD NOT be used unless required by exceptional circumstances. Justification: The sets of C0 and C1 control codes that should and should not be used should be defined explicitly, and with code point values. Only HT, LF and CR are very widely used. *** *** 2. CR MUST NOT appear except when immediately followed by either NUL or LF, with the latter (CR LF) designating the "new line" function. Because page layout is better done in other ways and to avoid other types of confusion, CR NUL SHOULD preferably be avoided. 3. LF CR SHOULD NOT appear except as a side-effect of multiple CR LF sequences (e.g., CR LF CR LF). *** Suggested change: Remove points 2. and 3. Justification: The other suggested changes permit CR and LF. *** *** [...] 4. Versions of Unicode In retrospect, one of the advantages of ASCII [X3.4-1978] when it was chosen was that the code space was full when the Standard was first published. There was no practical way to add characters or change code point assignments without being obviously incompatible. Unicode does not have that property: there are large blocks of space reserved for future expansion and new versions, with new characters and code point assignments, appear at regular intervals. While there are some security issues if people deliberately try to trick the system (see Section 6), Unicode version changes should not have a significant impact on the text stream specification of this document for the following reasons: o The transformation between Unicode code table positions and the corresponding UTF-8 code is algorithmic; it does not depend on whether a code point has been assigned or not. o The normalization specified here, NFC (see Section 3), performs a very limited set of mappings, much more limited than those of the more extensive NFKC used in, e.g., nameprep [RFC3491]. *** Suggested change: Drop this second bullet and the following paragraph. Justification: They are unnecessary with changing NFC from MUST to SHOULD. *** *** The NFC tables may be updated over time as new characters are added, but the Unicode Consortium has guaranteed the stability of all NFC strings. That is, if a string does not contain any unassigned characters, and it is normalized according to NFC, it will always be normalized according to all future versions of the Unicode Standard. The stability of the Net-Unicode format is thus guaranteed when any implementation that converts text into Net-Unicode format does not permit unassigned characters. Were Unicode to be changed in a way that violated these assumptions, i.e., that either invalidated the string order of RFC 3629 or that that changed the stability of NFC as stated above, this specification would not apply. Put differently, this specification applies only to versions of Unicode starting with version 3.2 and extending to, but not including, any version for which no changes are made in either the UTF-8 definition or to NFC stability. *** Suggested change: Modify the paragraph above, removing references to NFC. Justification: As a result, this specification will then apply to versions of Unicode starting with version 2.0. *** *** [...] 5.2. The Unicode Applicability Dilemma [...] *** Suggested change: Add an item for a fifth way to get around the problem: Strongly encourage use of normalization form NFC in interchanged text, but do not require it. Justification: This is the alternative discussed here. *** *** 9.1. Normative References *** Suggested change: Please add a reference for [RFC3629] UTF-8, a transformation format of ISO 10646 Justification: Missing reference. *** *** Best regards, Markus Scherer Google Software Internationalization ICU Project Developer