Re: HTML5 encodings (was: Re: BOCU patent)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Dec 26 2009 - 10:01:17 CST

  • Next message: Doug Ewell: "Re: HTML5 encodings (was: Re: BOCU patent)"

    "Doug Ewell" wrote:
    > André Szabolcs Szelp wrote:
    >
    > >> Well, here at Opera we had to disable support for two encodings
    > >> (UTF-7 and UTF-32) to become HTML5 conformant, if that isn't a waste
    > >> of developer time, I don't know what is :-)
    > >
    > > UTF-32 is stateful/poses a security risk?
    >
    > Only if someone thinks the existence of BE and LE variants poses a
    > security risk or constitutes statefulness in some way.
    >
    > Some people think "stateful" extends to multi-byte encodings, because
    > you have to keep track of where you within the sequence (lead code unit,
    > first trailing code unit, etc.). By that measure, UTF-32 is actually
    > less stateful than -8 or -16.

    How could you avoid using any multibyte encoding for transporting
    characters of the UCS? single-byte encodings are dead immediately when
    you have to manage many encodings (which means even more states just
    to maintain the long-list list of encodings to support (some of them
    with many more ambuguities about their mapping to the UCS, than
    characters actually encoded in the UCS). And if these charsets were
    not registered internationally (the ISO working group about it has now
    closed its work, even if there remains a IANA registry open mostly for
    some large vendors that have mostly stopped developing new 8-buit
    charsets, or national authorities)

    If I look at UTF-32BE or UTF-32LE, it has only 4 states (you have to
    merge the final states with the initial state). Mixing them and
    supporting the optional BOM requires adding 3 other states so you have
    finally 11 states for UTF-32. With UTF-8 you only 10 states (if you
    count them for each possible length, andmerge the final states with
    the initial state), one less than UTF-32. So UTF-8 still wins : it is
    LESS stateful than UTF-32...
    Only UTF32-BE or UTF32LE win in terms of number of states. UTF-16BE
    and UTF-16LE also win compared to UTF-8 (they have same number of
    states os UTF32-BE and UTF32-LE).

    Actually, when you'll have to manage invlaid byte sequences you'll
    need other states to manage the resynchronization (to avoid
    generatingt multiple replacement characters in the decoder): you need
    more states for the resynchronization of UTF-8 than for UTF-16BE and
    UTF-16LE or UTF-32BE and UTF-32LE. And finally, even UTF-16BE and
    UTF16-LE are winning for the resynchronization.

    Clearly, UTF-16BE and UTF-16LE are the simplest encodings, with less
    states, it will probably be more secure and definitiely fasterto
    compute for very large volumes at high rates (such as in memory). As
    it will also almost devide by 2 the memory footprint, compared to
    UTF-32, it will not invalide the data caches, if this takes some
    importance, as often as UTF-32 (this is important for new massively
    parallel architectures where there are many cores with little local
    memory, because a fast cache is expensive and reduced in size, and the
    communication bus are much slower (this is even more citical with
    storage I/O or even with network I/O).

    If UTF-8 was chosen, it was because it was a good compromize wioth
    today's networking techniologies. But when tomorrow we'll have
    gigabit/s transfer rates on the Internet, so that it iwill be almost
    as fast as local storage or processing memory, the difference of speed
    will be much more reduiced and will become insignificant (what will
    really count in networking is less the volume and the data rate, but
    the number of exchanges needed between two nodes in the net: they need
    to use smarter protocols, and include more caches to reduce the number
    of long-distance requests). In this case, the encodiung will really
    not matter, except for the storage in the most local caches (think
    then about bus I/O contention, and UTF-8 will be quite bad compared to
    UTF-16BE/LE, due to data alignment constraints, even if UTF-32BE/LE
    will probably still be poorer for quite long...)

    Now suppose that tomorrow, 32-bit oe 64-bit computing becomes
    universal, and processors only support 32-bit or 64-bit accesses
    (octet access will no longer be supported in a single operation,
    except though bitfield operations, as bytes will be 32 or 64-bit only,
    to reduce the complexity of bus access I/O contention caused by data
    unalignment). In that case UTF-32BE/LE will probably win: such
    processor will probably use Little-Endianness (this has become the
    standard, given that there are no longer new BE processors) and the
    best encoding will be UTF-32LE.

    Note that the C and C++ standatrds just require that the char datatype
    is the smallest adressable unit of memory, it does not indicate that
    "char" MUST be only 8-bit.



    This archive was generated by hypermail 2.1.5 : Sat Dec 26 2009 - 10:07:10 CST