Filtering and displaying untrusted UTF-8

From: - - (crossroads0000@googlemail.com)
Date: Sun Dec 27 2009 - 11:56:04 CST

  • Next message: verdy_p: "re: Filtering and displaying untrusted UTF-8"

    Hello.

    I'm currently trying to figure out which steps to take after receiving
    UTF-8 over a connection. I cannot trust the sender in any way, so
    input validation and filtering HAS to be done. The UTF-8 data is text,
    which is why I also want to filter out control characters which have
    nothing to do with proper text presentation (that is, directional
    markers may be allowed in the UTF-8 stream, control characters like
    U+0001 however not).

    I want to present my steps here for comments and suggestions. Remember
    that security is paramount and I welcome every suggestion on which
    code points should also be filtered out. Filtering out means in this
    context that they are simply "cut out".

    Here's what I do right now:

    1) Validate that UTF-8 is well-formed with no overlong byte sequences
    or 5 to 6 byte sequences.

    2) For code points in planes 0 to 2 (BMP, SMP, SIP) filter the following:
            * 0x0000 - 0x001F (1st bunch of control characters)
            * 0x007F - 0x009F (2nd bunch of control characters)
            * 0xD800 - 0xDFFF (surrogate pairs, have no use in UTF-8)
            * 0xE000 - 0xF900 (private use; since everyone can make up a
    different character for a code point in private use, filter them all)
            * 0xFEFF (byte order mark, no use in UTF-8 and may be
    potentially dangerous if converted later to UTF-16 without proper
    filtering)
            * 0xFFFE (byte order mark in wrong endian format, guaranteed
    never to be assigned as a Unicode character)
            * 0xFFFF (also guaranteed never to be assigned as a Unicode character).

        For the rest, allow all ***assigned*** code points, filter unassigned.

    3) For code points in planes 3 to 13 (unassigned planes) filter the
    complete range 0x30000 to 0xDFFFF.

    4) For code points in plane 14 (SSP) allow all ***assigned*** code
    points, filter unassigned.

    5) For code points in plane 15 and 16 (private use) filter the
    complete range 0xF0000 - 0x10FFFF. Same argument as before: since
    everyone can make up a different character for a code point in private
    use, filter them all.

    I'm looking forward to informed comments, especially on point 4). I'm
    not sure on whether I should allow any code points from plane 14,
    especially since they seem to be tags mostly (what are they good
    for?). Also, are the steps in taken in points 1) to 5) enough?

    My final question is this: which of the (in the previous steps)
    allowed code points ***higher than*** 127 do I have to "HTML encode"
    if I display them in an HTML page? None? Or is it possible that
    characters with code points outside the US-ASCII range may be
    interpreted by the browser in a similar way to < & and > in the
    US-ASCII range, thereby allowing for an XSS attack?

    Thanks for reading my lengthy post. :-)



    This archive was generated by hypermail 2.1.5 : Sun Dec 27 2009 - 20:20:15 CST