L2/00-371 Kenneth Whistler on 10/19/2000 03:04:43 PM Re: UTC action on malformed/illegal UTF-8 sequences? Ed, First of all, I don't think this discussion should be cc'd to unicore, x3l2 *and* the unicode list. You are discussing the specifics of a (possible) proposed change to the Unicode Standard, and the best forum for that is unicore. Regarding the specifics you are concerned about, I am becoming convinced that the security community has decided this is a security problem in UTF-8. I'm not convinced myself, yet, but in this area, the perception of a problem *is* a problem. The main issue I see is in the note to D32, which seems to imply that, contrary to the statements elsewhere, irregular UTF-8 sequences *are* o.k., and can in fact be interpreted. I think we need to clean up the note to D32, and the text at the end of section 3.8. The statements involving "shall" at the end of section 3.8, with clear implications for conformance, should be upgraded to explicit numbered conformance clauses under Transformations in Section 3.1. In particular, those requirements are: "When converting a Unicode scalar value to UTF-8, the shortest form that can represent those values shall be used." "Irregular UTF-8 sequences shall not be used for encoding any other information." These really belong as explicit conformance clauses (i.e. C12a, C12b), reworded appropriately. Then all the hedging in D32 and at the end of Section 3.8 about well, maybe you can interpret the non-shortest UTF-8 anyway should be recast along these lines: The Unicode Standard does not *require* a conformant process interpreting UTF-8 to *detect* that an irregular code value sequence has been used. [[ Fill in here, blah, blah, blah, about how more efficient conversion algorithms can be written for UTF-8 if they don't have to special-case non-shortest, irregular sequence UTF-8...]] However, the Unicode Standard does *recommend* that any process concerned about security issues detect and flag (by raising exceptions or other appropriate means) any irregular code value sequence. This recommendation is to help minimize the risk that a security attack could be mounted by utilizing information stored in irregular UTF-8 sequences undetected by an interpreting process. If we cast things this way, it will be clear to all the concerned security worrywarts (that's their job, man) that the Unicode Standard has considered the issue and has a position on it. It will also be clear in the conformance clauses that conformance to the Unicode Standard itself *requires* the non-production of irregular UTF-8 sequences. However, the standard isn't going to reach out and place a draconian *interpretation* requirement on a UTF-8 interpreting process (most often, we are talking about a UTF-8 --> UTF-16 conversion algorithm) that would force everybody to do the shortest value checking in order to be conformant. For a reductio ad absurdum, take the C library function strcpy(). As it stands now, right out of the box, the strcpy() function is Unicode conformant for use with UTF-8. If you feed a null-terminated UTF-8 string at it, it will correctly copy the contents of that string into another buffer. But if we went for an overly strong conformance clause regarding irregular sequence UTF-8, technically strcpy() would no longer be conformant for use in a Unicode application. You would have to rewrite it so it parsed out the UTF-8 stream, checked for irregular sequences, and raised an exception or returned an error if it ran into the sequence 0xC0 0x81, for example. I know the nitpickers can pick nits on this example, since strcpy() really just copies code units, not characters, but it wouldn't be too hard to find API's or processes that are concerned with characters per se and that would have similar problems if forced to detect and reject non-shortest UTF-8 in order to be conformant. --Ken 2