Re: HTML5 encodings (was: Re: BOCU patent)

From: Doug Ewell (
Date: Mon Dec 21 2009 - 08:38:00 CST

  • Next message: Charlie Ruland ☘: "Re: Is there a Japanese character for the word Unicode? (from Re: Unicode Haiku Contest)"

    Peter Krefting <peter at opera dot com> wrote:

    >> "User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
    >> encodings."
    >> Amazing, isn't it? So thoughtful of the HTML 5 WG to protect
    >> developers' time by prohibiting a handful of selected encodings.
    > There are some security issues related to these, and they are very
    > rarely used on actual web pages, which is why they are on the
    > "prohibited" list. Full reasoning behind it can probably be found on
    > the HTML5 mailing list, although I do not have a link to share. One of
    > the problems is that they are not ASCII based, and theoretically
    > something like "<script>" can be encoded in such a way that a naïve
    > ASCII-based parser wouldn't find it and filter it away from
    > user-submitted input, making it easier to do cross-domain attacks.

    SCSU is completely ASCII-based, as long as the text is in single-byte
    mode, which would be the case for the entire HTML header, and usually
    the entire text when encoding small alphabets. In "Unicode mode," SCSU
    is essentially UTF-16BE (with a non-ASCII escape for some private-use
    characters), and UTF-16BE is not prohibited.

    The security issue is largely a red herring. Security of HTML encodings
    is related to incorrect auto-discovery of encodings, not to using
    encodings that have been properly announced. Even UTF-7, while
    generally undesirable and unnecessary for Web pages, is "secure" if
    correctly identified.

    Henri Sivonen stated that the main reason for prohibiting encodings was
    to avoid "wasting developer time" and focusing attention on support of
    new features instead. Apparently he didn't feel developers were capable
    of both.

    Doug Ewell  |  Thornton, Colorado, USA  |
    RFC 5645, 4645, UTN #14  |  ietf-languages @ ­

    This archive was generated by hypermail 2.1.5 : Mon Dec 21 2009 - 08:41:16 CST