Re: HTML5 encodings (was: Re: BOCU patent)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Dec 26 2009 - 10:01:17 CST

Next message: Doug Ewell: "Re: HTML5 encodings (was: Re: BOCU patent)"

Previous message: John W Kennedy: "Re: Medievalist ligature character in the PUA"
Maybe in reply to: Peter Krefting: "HTML5 encodings (was: Re: BOCU patent)"
Next in thread: Doug Ewell: "Re: HTML5 encodings (was: Re: BOCU patent)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Doug Ewell" wrote:
> André Szabolcs Szelp wrote:
>
> >> Well, here at Opera we had to disable support for two encodings
> >> (UTF-7 and UTF-32) to become HTML5 conformant, if that isn't a waste
> >> of developer time, I don't know what is :-)
> >
> > UTF-32 is stateful/poses a security risk?
>
> Only if someone thinks the existence of BE and LE variants poses a
> security risk or constitutes statefulness in some way.
>
> Some people think "stateful" extends to multi-byte encodings, because
> you have to keep track of where you within the sequence (lead code unit,
> first trailing code unit, etc.). By that measure, UTF-32 is actually
> less stateful than -8 or -16.

How could you avoid using any multibyte encoding for transporting
characters of the UCS? single-byte encodings are dead immediately when
you have to manage many encodings (which means even more states just
to maintain the long-list list of encodings to support (some of them
with many more ambuguities about their mapping to the UCS, than
characters actually encoded in the UCS). And if these charsets were
not registered internationally (the ISO working group about it has now
closed its work, even if there remains a IANA registry open mostly for
some large vendors that have mostly stopped developing new 8-buit
charsets, or national authorities)

If I look at UTF-32BE or UTF-32LE, it has only 4 states (you have to
merge the final states with the initial state). Mixing them and
supporting the optional BOM requires adding 3 other states so you have
finally 11 states for UTF-32. With UTF-8 you only 10 states (if you
count them for each possible length, andmerge the final states with
the initial state), one less than UTF-32. So UTF-8 still wins : it is
LESS stateful than UTF-32...
Only UTF32-BE or UTF32LE win in terms of number of states. UTF-16BE
and UTF-16LE also win compared to UTF-8 (they have same number of
states os UTF32-BE and UTF32-LE).

Actually, when you'll have to manage invlaid byte sequences you'll
need other states to manage the resynchronization (to avoid
generatingt multiple replacement characters in the decoder): you need
more states for the resynchronization of UTF-8 than for UTF-16BE and
UTF-16LE or UTF-32BE and UTF-32LE. And finally, even UTF-16BE and
UTF16-LE are winning for the resynchronization.

Clearly, UTF-16BE and UTF-16LE are the simplest encodings, with less
states, it will probably be more secure and definitiely fasterto
compute for very large volumes at high rates (such as in memory). As
it will also almost devide by 2 the memory footprint, compared to
UTF-32, it will not invalide the data caches, if this takes some
importance, as often as UTF-32 (this is important for new massively
parallel architectures where there are many cores with little local
memory, because a fast cache is expensive and reduced in size, and the
communication bus are much slower (this is even more citical with
storage I/O or even with network I/O).

If UTF-8 was chosen, it was because it was a good compromize wioth
today's networking techniologies. But when tomorrow we'll have
gigabit/s transfer rates on the Internet, so that it iwill be almost
as fast as local storage or processing memory, the difference of speed
will be much more reduiced and will become insignificant (what will
really count in networking is less the volume and the data rate, but
the number of exchanges needed between two nodes in the net: they need
to use smarter protocols, and include more caches to reduce the number
of long-distance requests). In this case, the encodiung will really
not matter, except for the storage in the most local caches (think
then about bus I/O contention, and UTF-8 will be quite bad compared to
UTF-16BE/LE, due to data alignment constraints, even if UTF-32BE/LE
will probably still be poorer for quite long...)

Now suppose that tomorrow, 32-bit oe 64-bit computing becomes
universal, and processors only support 32-bit or 64-bit accesses
(octet access will no longer be supported in a single operation,
except though bitfield operations, as bytes will be 32 or 64-bit only,
to reduce the complexity of bus access I/O contention caused by data
unalignment). In that case UTF-32BE/LE will probably win: such
processor will probably use Little-Endianness (this has become the
standard, given that there are no longer new BE processors) and the
best encoding will be UTF-32LE.

Note that the C and C++ standatrds just require that the char datatype
is the smallest adressable unit of memory, it does not indicate that
"char" MUST be only 8-bit.

Next message: Doug Ewell: "Re: HTML5 encodings (was: Re: BOCU patent)"
Previous message: John W Kennedy: "Re: Medievalist ligature character in the PUA"
Maybe in reply to: Peter Krefting: "HTML5 encodings (was: Re: BOCU patent)"
Next in thread: Doug Ewell: "Re: HTML5 encodings (was: Re: BOCU patent)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Dec 26 2009 - 10:07:10 CST