C7 Fix

L2/09-241

Subject: Removing noncharacter code points in C7
Date: 2009-07-17
From: Mark Davis
To: UTC

I propose that we drop the phrase "or the deletion of noncharacter code points" in C7.

There is a serious security problem when people delete noncharacters. In response, we are making one addition to C7 in 5.2 (the last, bolded, bullet below), but we need to be stronger; we've seen more instances of this pop up, and it has become clear that that clause in C7 is very problematic.

Here is the current text:

C7 When a process purports not to modify the interpretation of a valid coded character sequence, it shall make no change to that coded character sequence other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points.

Replacement of a character sequence by a compatibility-equivalent sequence does modify the interpretation of the text.
Replacement or deletion of a character sequence that the process cannot or does not interpret does modify the interpretation of the text.
Note that security problems can result if noncharacter code points are removed from text received from external sources. For more information, see Section 16.7, Noncharacters, and Unicode Technical Report #36, “Unicode Security Considerations.”
...

Fundamentally, C7 is about meaning. The principle is that "abc<cedilla>d" means the same as "ab<c-cedilla>d". C7 just gives a more practical statement to that principle.

So far, so good. But the last clause fails that test - nobody says that:

"abc/.<nonchar1>./d" means the same as "abc/../d".

Fixing C7 doesn't mean that you can't remove <nonchar1> -- it just means that when you remove it, you are changing the meaning, because, well, the strings don't mean the same thing. If they did mean the same thing, then the logical implication is that I could arbitrarily insert a <nonchar1> into an arbitrary string.

Note that we don't say that "abc/.<surrogate1>./d" means the same as "abc/../d", either; and nor should we - deletion there is just as problematic (and of course insertion is awful). C7 says what makes a difference in the interpretation of text; yet the presence or absence of noncharacters does make a difference. Like private use characters, you may not be able to know what the noncharacter means, but its presence or absence does make a difference, especially for security.

Allowing replacement in C7 by FFFD we briefly discussed, but we can't say that "abc/.<nonchar1>./d" means the same as "abc/.<FFFD>./d", because that would imply the reverse as well.

The cleanest approach is to fix C7, and verify that we have sufficient warnings against the use of nonchar or surrogates in open interchange, and warnings that it is a real problem to delete them on input; a good alternative technique is to map to FFFD on input.

This would also require changes in the 3rd and 4th paragraphs in Section 16.7 for consistency.