L2/07-351 Title: Modification of Text Regarding Ill-formed Code Unit Sequences Source: Ken Whistler and Mark Davis Date: October 11, 2007 Action: For consideration by the UTC This proposal is the follow-up on our action item 111-A012, to draft proposed language to the effect of specifying the interpretation rules for UTF ill-formed sequences. After considering various textual approaches, we have consensus that the following draft text can provide the clarification required, particularly regarding the behavior of UTF-8 conversion processes dealing with ill-formed sequences, while keeping the modification of existing text to a minimum. In particular, this proposed text leaves all *existing* conformance clauses and definitions unchanged, but adds two new definitions and a fair amount of clarifying text making use of those new definitions. We propose that the UTC approve this draft specifically for addition to Unicode 5.1. The draft text below includes editorial comments. We anticipate that after undergoing editing for inclusion in the actual documentation page for Unicode 5.1, the text change will simply be written in terms of replacing X..Y paragraphs on pp. 100 and 101 with the completely written out and formatted replacement text. The Unicode 5.1 documentation would also include a short paragraph explaining why the UTC has amended and extended this text -- presuming that the text change is approved. The following is the detailed draft for the text change: ========================= draft text ======================= [[ Modify the existing text on p. 100, in Chapter 3, from D84 through D86, as follows: ]] D84 Ill-formed: A Unicode code unit sequence that purports to be in a Unicode encoding form is called ill-formed if and only if it does not follow the specification of that Unicode encoding form. [[ Note: D84 is unchanged from the existing text ]] [[ Keep the two existing bullets as is. ]] D84a Ill-formed code unit subsequence: A non-empty subsequence of a Unicode code unit sequence X which does not contain any code units which also belong to any well-formed subsequence of X. * In other words, an ill-formed code unit subsequence cannot overlap with a well-formed subsequence. D85 Well-formed: A Unicode code unit sequence that purports to be in a Unicode form is called well-formed if and only if it does follow the specification of that Unicode encoding form. [[ Note: D85 is also unchanged from the existing text ]] D85a Well-formed code unit subsequence: A non-empty, well-formed subsequence of a well-formed Unicode code unit sequence. [[ Replace the existing bullet for D85 with the following text, unbulleted: ]] Any Unicode code unit sequence can be partitioned into subsequences that are either well-formed or ill-formed. The sequence as a whole is well-formed if and only if it contains no ill-formed subsequence. The sequence as a whole is ill-formed if and only if it contains at least one ill-formed subsequence. D86 Well-formed UTF-8 code unit sequence: A well-formed Unicode code unit sequence of UTF-8 code units. [[ Note: D86 is unchanged from the existing text ]] [[ Add the specific examples of well-formed and ill-formed UTF-8 here, as follows ]] * The UTF-8 code unit sequence <41 C3 B1 42> is well-formed, because it can be partitioned into subsequences, all of which match the specification for UTF-8 in Table 3-7. It consists of the following well-formed subsequences: <41>, , and <42>. * The UTF-8 code unit sequence <41 C2 C3 B1 42> is ill-formed, because it contains one ill-formed subsequence. There is no subsequence for the C2 byte which matches the specification for UTF-8 in Table 3-7. The code unit sequence is partitioned into one well-formed code unit subsequence, <41>, followed by one ill-formed code unit subsequence, , followed by two well-formed code unit subsequences, and <42>. * In isolation, the UTF-8 code unit sequence would be ill-formed, but in the context of the UTF-8 code unit sequence <41 C2 C3 B1 42>, does not constitute an ill-formed code unit subsequence, because the C3 byte is actually the first byte of the well-formed UTF-8 code unit subsequence . Ill-formed code unit subsequences do not overlap with well-formed code unit subsequences. [[ Existing text continues unchanged from this point. ]] =========================================================== [[ Replace the existing paragraph on p. 101 just above Table 3-4 with the following text: ]] If a Unicode string purports to be in a Unicode encoding form, then it must not contain any ill-formed code unit subsequence. If a process which verifies that a Unicode string is in a Unicode encoding form encounters an ill-formed code unit subsequence in that string, then it must not identify that string as being in that Unicode encoding form. [[ Those two paragraphs are only minor modifications of the existing text, to make use of the ill-formed code unit subsequence definition. ]] A process which interprets a Unicode string must not interpret any ill-formed code unit subsequences in the string as characters. (See conformance clause C10.) Furthermore, such a process must not treat any adjacent well-formed code unit sequences as being part of those ill-formed code unit sequences. The most important consequence of this requirement on processes is illustrated by UTF-8 conversion processes, which interpret UTF-8 code unit sequences as Unicode character sequences. Suppose that a UTF-8 converter is iterating through an input UTF-8 code unit sequence. If the converter encounters an ill-formed UTF-8 code unit sequence which starts with a valid first byte, but which does not continue with valid successor bytes (see Table 3-7), it must not consume the successor bytes as part of the ill-formed subsequence whenever those successor bytes themselves constitute part of a well-formed UTF-8 code unit subsequence. If an implementation of a UTF-8 conversion process stops at the first error encountered, without reporting the end of any ill-formed UTF-8 code unit subsequence, then the requirement makes little practical difference. However, the requirement does introduce a significant constraint if the UTF-8 converter continues past the point of a detected error, perhaps by substituting one or more U+FFFD replacement characters for the uninterpretable, ill-formed UTF-8 code unit subsequence. For example, with the input UTF-8 code unit sequence , such a UTF-8 conversion process must not return or , since either of those outputs would be the result of misinterpreting a well-formed subsequence as being part of the ill-formed subsequence. What is expected is that such a process should return . For a UTF-8 conversion process to consume valid successor bytes is not only non-conformant, but also leaves the converter open to security exploits. See UTS #36, Unicode Security Guidelines. Although a UTF-8 conversion process is required to never consume well-formed subsequences as part of its error handling for ill-formed subsequences, such a process is not otherwise constrained in how it deals with any ill-formed subsequence itself. An ill-formed subsequence consisting of more than one code unit could be treated as a single error or as multiple errors. For example, in processing the UTF-8 code unit sequence , the only requirement on a converter is that the <41> be processed and correctly interpreted as . The converter could return , handling as a single error, or , handling each byte of as a separate error, or could take other approaches to signalling as an ill-formed code unit subsequence. ======================= end of draft =======================