Re: Filtering and displaying untrusted UTF-8

From: Asmus Freytag (
Date: Sun Dec 27 2009 - 21:29:33 CST

  • Next message: Doug Ewell: "Re: HTML5 encodings (was: Re: BOCU patent)"

    First of all, hiding behind "--" as your "friendly" e-mail name isn't
    very friendly, and not really in keeping with how this list has operated
    so far. Especially not if you are expecting to get some very detailed

    As it is, I'm addressing the remainder of the comments to the list as a
    whole, since I have difficulties visualizing a real human behind "--"
    (is that emoticon shorthand for "eyes tightly shut"?).

    On 12/27/2009 9:56 AM, - - wrote:
    > Filtering out means in this context that they are simply "cut out".
    I seem to recall reading that removal of anything from an input stream
    can cause security problems of its own. For my own sake, I'd be curious
    to learn where in an architecture the best place is to "clean up" data.
    > Here's what I do right now:
    > 1) Validate that UTF-8 is well-formed with no overlong byte sequences
    > or 5 to 6 byte sequences.
    > 2) For code points in planes 0 to 2 (BMP, SMP, SIP) filter the following:
    > * 0x0000 - 0x001F (1st bunch of control characters)
    This would eliminate the TAB character. That doesn't seem promising for
    > * 0x007F - 0x009F (2nd bunch of control characters)
    > * 0xD800 - 0xDFFF (surrogate pairs, have no use in UTF-8)
    These surrogates don't occur in well-formed UTF-8. (See
    > * 0xE000 - 0xF900 (private use; since everyone can make up a
    > different character for a code point in private use, filter them all)
    The private use range ends at F8FF, not F900
    > * 0xFEFF (byte order mark, no use in UTF-8 and may be
    > potentially dangerous if converted later to UTF-16 without proper
    > filtering)
    > * 0xFFFE (byte order mark in wrong endian format, guaranteed
    > never to be assigned as a Unicode character)
    > * 0xFFFF (also guaranteed never to be assigned as a Unicode character).
    > For the rest, allow all ***assigned*** code points, filter unassigned.
    That's a fool's game, because assigned code points are version
    dependent. Even if one could adopt a "supported version" for one's own
    code, nothing guarantees that the codes were assigned at the time the
    originating software was written. If not, they could represent data that
    wasn't really text in the context it was created in. Further, the minute
    the next version of Unicode comes along, this will prevent the software
    from handling perfectly well-defined and standardized characters.

    At the same time, there's no attempt to filter the non-characters in the
    FDD0-FDEF range, which looks like a clear omission.
    > 3) For code points in planes 3 to 13 (unassigned planes) filter the
    > complete range 0x30000 to 0xDFFFF.
    > 4) For code points in plane 14 (SSP) allow all ***assigned*** code
    > points, filter unassigned.
    The "Tag characters" from E0000 to E007F are deprecated and have no
    business in ordinary text. Much more useful set of characters to
    consider for filtering than those that are merely "not yet assigned".
    > 5) For code points in plane 15 and 16 (private use) filter the
    > complete range 0xF0000 - 0x10FFFF. Same argument as before: since
    > everyone can make up a different character for a code point in private
    > use, filter them all.
    In principle, this might be a defensible choice, especially if there's a
    need to compare data from different sources against each other. But ,
    I'm afraid that this depends on the purposes for which the text is being
    accepted, and that is not clear enough in the context of this discussion.


    This archive was generated by hypermail 2.1.5 : Sun Dec 27 2009 - 21:32:04 CST