Re: Filtering and displaying untrusted UTF-8

From: Dominikus Scherkl (
Date: Mon Dec 28 2009 - 01:35:23 CST

  • Next message: verdy_p: "Re: HTML5 encodings (was: Re: BOCU patent)"

    Hash: SHA1

    Asmus Freytag schrieb:
    > On 12/27/2009 9:56 AM, - - wrote:
    >> 1) Validate that UTF-8 is well-formed with no overlong byte sequences
    >> or 5 to 6 byte sequences.
    >> 2) For code points in planes 0 to 2 (BMP, SMP, SIP) filter the following:
    >> * 0x0000 - 0x001F (1st bunch of control characters)
    > This would eliminate the TAB character. That doesn't seem promising for
    > "text".
    It would also filter CR and LF. At least these three should not be
    filtered. I personally would also allow VT (vertical tab).

    >> * 0x007F - 0x009F (2nd bunch of control characters)
    >> * 0xD800 - 0xDFFF (surrogate pairs, have no use in UTF-8)
    > These surrogates don't occur in well-formed UTF-8. (See
    >> * 0xE000 - 0xF900 (private use; since everyone can make up a
    >> different character for a code point in private use, filter them all)
    > The private use range ends at F8FF, not F900
    >> * 0xFEFF (byte order mark, no use in UTF-8 and may be
    >> potentially dangerous if converted later to UTF-16 without proper
    >> filtering)
    >> * 0xFFFE (byte order mark in wrong endian format, guaranteed
    >> never to be assigned as a Unicode character)
    >> * 0xFFFF (also guaranteed never to be assigned as a Unicode
    >> character).
    How about the other non-characters at 100FE, 100FF, 200FE, 200FF, ...?

    >> For the rest, allow all ***assigned*** code points, filter
    >> unassigned.
    > That's a fool's game, because assigned code points are version
    > dependent. Even if one could adopt a "supported version" for one's own
    > code, nothing guarantees that the codes were assigned at the time the
    > originating software was written. If not, they could represent data that
    > wasn't really text in the context it was created in. Further, the minute
    > the next version of Unicode comes along, this will prevent the software
    > from handling perfectly well-defined and standardized characters.
    > At the same time, there's no attempt to filter the non-characters in the
    > FDD0-FDEF range, which looks like a clear omission.
    >> 3) For code points in planes 3 to 13 (unassigned planes) filter the
    >> complete range 0x30000 to 0xDFFFF.
    >> 4) For code points in plane 14 (SSP) allow all ***assigned*** code
    >> points, filter unassigned.
    > The "Tag characters" from E0000 to E007F are deprecated and have no
    > business in ordinary text. Much more useful set of characters to
    > consider for filtering than those that are merely "not yet assigned".
    >> 5) For code points in plane 15 and 16 (private use) filter the
    >> complete range 0xF0000 - 0x10FFFF. Same argument as before: since
    >> everyone can make up a different character for a code point in private
    >> use, filter them all.
    > In principle, this might be a defensible choice, especially if there's a
    > need to compare data from different sources against each other. But ,
    > I'm afraid that this depends on the purposes for which the text is being
    > accepted, and that is not clear enough in the context of this discussion.

    Best Regards,

    - --

    Dominikus Dittes Scherkl
    Version: GnuPG v1.4.10 (MingW32)
    Comment: Using GnuPG with Mozilla -

    -----END PGP SIGNATURE-----

    This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 01:37:10 CST