RE: HTML5 encodings (was: Re: BOCU patent)

From: Phillips, Addison (
Date: Thu Dec 24 2009 - 12:19:29 CST

  • Next message: Doug Ewell: "Re: Medievalist ligature character in the PUA"

    Many multibyte encodings are stateful---because the values of the leading and trailing byte values overlap or because the interpretation of the code units depends on position of other values in the data stream. The UTF-x Unicode encodings are not stateful in these ways.

    UTF-8 and UTF-16 are, however, variable width encodings.

    The problem with UTF-32 in HTML5, AFAICT from the discussion about it, had little to do with using it for a security exploit. It has to do with the complications that it adds to encoding detection combined with the general feeling that UTF-32 ought to be avoided as a wire encoding anyway. See, for example, the discussion about here:

    I don't personally think that banning UTF-32 is absolutely necessary, but I do agree that it doesn't make a lot of sense for anyone to be using it. Banning implementations of it is a pretty effective form of discouragement.

    Addison Phillips
    Globalization Architect -- Lab126

    Internationalization is not a feature.
    It is an architecture.

    > -----Original Message-----
    > From: []
    > On Behalf Of Doug Ewell
    > Sent: Wednesday, December 23, 2009 10:38 PM
    > To: Unicode Mailing List
    > Cc: André Szabolcs Szelp; Peter Krefting
    > Subject: Re: HTML5 encodings (was: Re: BOCU patent)
    > André Szabolcs Szelp wrote:
    > >> Well, here at Opera we had to disable support for two encodings
    > >> (UTF-7 and UTF-32) to become HTML5 conformant, if that isn't a
    > waste
    > >> of developer time, I don't know what is :-)
    > >
    > > UTF-32 is stateful/poses a security risk?
    > Only if someone thinks the existence of BE and LE variants poses a
    > security risk or constitutes statefulness in some way.
    > Some people think "stateful" extends to multi-byte encodings,
    > because
    > you have to keep track of where you within the sequence (lead code
    > unit,
    > first trailing code unit, etc.). By that measure, UTF-32 is
    > actually
    > less stateful than -8 or -16.
    > --
    > Doug Ewell | Thornton, Colorado, USA |
    > RFC 5645, 4645, UTN #14 | ietf-languages @ ­

    This archive was generated by hypermail 2.1.5 : Thu Dec 24 2009 - 12:21:38 CST