Re: Filtering and displaying untrusted UTF-8

From: verdy_p (
Date: Mon Dec 28 2009 - 02:16:01 CST

  • Next message: Michael Everson: "Vertical line(s) below"

    "Dominikus Scherkl" wrote:
    > Hash: SHA1
    > Asmus Freytag schrieb:
    > > On 12/27/2009 9:56 AM, - - wrote:
    > >> 1) Validate that UTF-8 is well-formed with no overlong byte sequences
    > >> or 5 to 6 byte sequences.
    > >>
    > >> 2) For code points in planes 0 to 2 (BMP, SMP, SIP) filter the following:
    > >> * 0x0000 - 0x001F (1st bunch of control characters)
    > >>
    > > This would eliminate the TAB character. That doesn't seem promising for
    > > "text".
    > It would also filter CR and LF. At least these three should not be
    > filtered. I personally would also allow VT (vertical tab).

    Simply for the compatibility with many text-editors, if I had to keep only one end-of-line control character (all
    others being normalized to it in plain texts), I would keep just LF which maps conveniently as the default "\n"
    character in C/C++ (but CR on MacOS plaforms where the mapping of \n and \r were historically swapped), Java and C#
    (you don't have this choice). VT is rarely used as the end-of-line mark, most editors will render it with some glyph
    or with some escaped meta-notation (e.g. in Emacs and vi or vim with classic console charsets).

    But I would definitely not filter the new line controls: normalizing these controls (or the CR+LF sequence) on input
    from external sources will remain (notably because CR+LF is normally mandatory in MIME plain-text formats and in
    many text-based Web protocols, including HTTP or FTP and their secure variants).

    And I would also include FF (mappable as the escape sequence "\f" in C/C++/Java/J#/C#) as another newline and as a
    whitespace : it occurs quite frequently in many C/C++ sources, to specify a page break position when printing or
    rendering the source to a paged media such as a PDF report (it occurs in fact much more frequently than VT, that
    I've never seen and that is probably rejected as an invalid source characters in many computer languages, including
    C/C++ compilers even when they support the "\v" escape for mapping it in litteral string or character constants, or
    in character array initializers).


    This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 02:18:36 CST