Re: Filtering and displaying untrusted UTF-8

From: Jason Schauberger (crossroads0000@googlemail.com)
Date: Mon Dec 28 2009 - 17:05:23 CST

  • Next message: verdy_p: "Re: Filtering and displaying untrusted UTF-8"

    Hello again.

    On Mon, Dec 28, 2009 at 7:50 PM, Jukka K. Korpela <jkorpela@cs.tut.fi> wrote:
    > Therefore, regarding U+FEFF as not allowed in plain text datastream would be
    > a big mistake, even though filtering it out would normally result in
    > inferior typography at most.

    Then it's probably the best to only disallow it if it's the first code
    point, otherwise let it through.

    On Mon, Dec 28, 2009 at 3:48 AM, verdy_p <verdy_p@wanadoo.fr> wrote:
    > May be the NEXT LINE (U+0085) character, in C1 controls, part of all ISO 8859 charsets (for MIME) at position 0x85,
    > which is valid as a line separator or as a blank in HTML?
    > You may want to replace it with CRLF sequences, or you may want to uniformize the various encodings of newlines (CR
    > not followed by LF, CR+LF, LF not following CR, NL) into a single one (such as LF, for compatibility with C language
    > standard I/O) on input (and generate CR+LF on output).
    >

    That's a good idea. I wonder if there are there any more code points
    which should be encoded in HTML?

    On Mon, Dec 28, 2009 at 4:29 AM, Asmus Freytag <asmusf@ix.netcom.com> wrote:
    >>
    >> 2) For code points in planes 0 to 2 (BMP, SMP, SIP) filter the following:
    >>        * 0x0000 - 0x001F (1st bunch of control characters)
    >>
    >
    > This would eliminate the TAB character. That doesn't seem promising for
    > "text".

    Agreed. As others have pointed out, the newline character(s) and
    similar should be in there as well.

    >>
    >>    For the rest, allow all ***assigned*** code points, filter unassigned.
    >>
    >
    > That's a fool's game, because assigned code points are version dependent.
    > Even if one could adopt a "supported version" for one's own code, nothing
    > guarantees that the codes were assigned at the time the originating software
    > was written. If not, they could represent data that wasn't really text in
    > the context it was created in. Further, the minute the next version of
    > Unicode comes along, this will prevent the software from handling perfectly
    > well-defined and standardized characters.

    I tend to disagree. Of course it's likely that now unassigned code
    points are assigned a character in future Unicode versions. However,
    it's also possible that some of them will be assigned non-characters.
    Then what's the point in filtering out any non-characters at all, if
    you're completely neglecting the possibility that new non-characters
    or control characters may be added in future versions and your
    algorithm is potentially leaving them unfiltered? This is not only
    inconsistent, but renders the current attempt at filtering completely
    moot. And if you argue that you could update the algorithm to be also
    aware of the new control characters, the same can be said about
    updating the algorithm to be aware of newly assigned text characters.

    I think it is much more consistent to offer an API call to get the
    current Unicode database version used and an easy way to update it
    when a new Unicode version is released, especially since AIUI most
    written and spoken languages are already represented in the current
    Unicode version. Hence, the possibilty of interchanged text becoming
    illegible due to completely new characters being filtered is rather
    slim.

    >
    > At the same time, there's no attempt to filter the non-characters in the
    > FDD0-FDEF range, which looks like a clear omission.

    I agree, FDD0-FDEF should be added to the list of characters to
    filter/replace. Same goes for 100FE, 100FF, 200FE, 200FF, and so on.

    >>
    >> 3) For code points in planes 3 to 13 (unassigned planes) filter the
    >> complete range 0x30000 to 0xDFFFF.
    >>
    >> 4) For code points in plane 14 (SSP) allow all ***assigned*** code
    >> points, filter unassigned.
    >>
    >
    > The "Tag characters" from E0000 to E007F are deprecated and have no business
    > in ordinary text. Much more useful set of characters to consider for
    > filtering than those that are merely "not yet assigned".

    I agree here, too.

    I wonder if it's maybe better to not leave out code points, but
    instead replace them with a replacement code point like 0xFFFD--"used
    to replace an incoming character whose value is unknown or
    unrepresentable in Unicode". Any thoughts?

    Kind regards.



    This archive was generated by hypermail 2.1.5 : Mon Dec 28 2009 - 17:16:29 CST