L2/00-371

Kenneth Whistler <kenw@sybase.com> on 10/19/2000 03:04:43 PM
Re: UTC action on malformed/illegal UTF-8 sequences?


Ed,

First of all, I don't think this discussion should be cc'd to unicore, x3l2
*and*
the unicode list. You are discussing the specifics of a (possible) proposed
change to the Unicode Standard, and the best forum for that is unicore.

Regarding the specifics you are concerned about, I am becoming convinced
that the security community has decided this is a security problem in
UTF-8. I'm not convinced myself, yet, but in this area, the perception
of a problem *is* a problem.

The main issue I see is in the note to D32, which seems to imply that,
contrary to the statements elsewhere, irregular UTF-8 sequences *are*
o.k., and can in fact be interpreted.

I think we need to clean up the note to D32, and the text at the end of
section 3.8. The statements involving "shall" at the end of section 3.8,
with clear implications for conformance, should be upgraded to explicit
numbered conformance clauses under Transformations in Section 3.1. In
particular, those requirements are:

"When converting a Unicode scalar value to UTF-8, the shortest form
that can represent those values shall be used."

"Irregular UTF-8 sequences shall not be used for encoding any other
information."

These really belong as explicit conformance clauses (i.e. C12a, C12b),
reworded appropriately.

Then all the hedging in D32 and at the end of Section 3.8 about well,
maybe you can interpret the non-shortest UTF-8 anyway should be
recast along these lines:

The Unicode Standard does not *require* a conformant process
interpreting UTF-8 to *detect* that an irregular code value sequence
has been used. [[ Fill in here, blah, blah, blah, about how more
efficient conversion algorithms can be written for UTF-8 if they
don't have to special-case non-shortest, irregular sequence UTF-8...]]
However, the Unicode Standard does *recommend* that any process
concerned about security issues detect and flag (by raising exceptions
or other appropriate means) any irregular code value sequence.
This recommendation is to help minimize the risk that a security
attack could be mounted by utilizing information stored in
irregular UTF-8 sequences undetected by an interpreting process.

If we cast things this way, it will be clear to all the concerned
security worrywarts (that's their job, man) that the Unicode Standard
has considered the issue and has a position on it. It will also be
clear in the conformance clauses that conformance to the Unicode
Standard itself *requires* the non-production of irregular UTF-8
sequences. However, the standard isn't going to reach out and place
a draconian *interpretation* requirement on a UTF-8 interpreting process
(most often, we are talking about a UTF-8 --> UTF-16 conversion
algorithm) that would force everybody to do the shortest value
checking in order to be conformant.

For a reductio ad absurdum, take the C library function strcpy().
As it stands now, right out of the box, the strcpy() function is
Unicode conformant for use with UTF-8. If you feed a null-terminated
UTF-8 string at it, it will correctly copy the contents of that
string into another buffer. But if we went for an overly strong
conformance clause regarding irregular sequence UTF-8, technically
strcpy() would no longer be conformant for use in a Unicode
application. You would have to rewrite it so it parsed out the
UTF-8 stream, checked for irregular sequences, and raised an
exception or returned an error if it ran into the sequence
0xC0 0x81, for example. I know the nitpickers can pick nits on
this example, since strcpy() really just copies code units, not
characters, but it wouldn't be too hard to find API's or processes
that are concerned with characters per se and that would have
similar problems if forced to detect and reject non-shortest UTF-8
in order to be conformant.

--Ken


	2