Re: HTML5 encodings (was: Re: BOCU patent)

From: verdy_p (verdy_p@wanadoo.fr)
Date: Sun Dec 27 2009 - 17:26:56 CST

  • Next message: Asmus Freytag: "Re: HTML5 encodings"

    "Doug Ewell" wrote:
    > "verdy_p" wrote:
    > > If I look at UTF-32BE or UTF-32LE, it has only 4 states (you have to
    > > merge the final states with the initial state). Mixing them and
    > > supporting the optional BOM requires adding 3 other states so you have
    > > finally 11 states for UTF-32. With UTF-8 you only 10 states (if you
    > > count them for each possible length, andmerge the final states with
    > > the initial state), one less than UTF-32. So UTF-8 still wins : it is
    > > LESS stateful than UTF-32...
    >
    > Usually, at least on this list, the transient information needed while
    > parsing multiple bytes into a single code point isn't thought of as
    > "state." When you parse multiple bytes into an integer value of some
    > sort, and still have to apply additional knowledge to turn THAT into a
    > code point (as in ISO 2022 or UTF-16), that is state.

    I disagree: I am not just counting the additional integer state variables needed for storing the semantic values of
    the first bytes in a sequence, but purely the enumerated states coming from the finite state automata needed to
    parse the stream.
    Clearly UTF-16LE (or BE) wins. Unfortunately, HTML5 now WANTS absolutely the encoding of BOM (meaning that it will
    have to handle its states, and verify them, and this means that UTF-16BE and UTF-16LE are disqualified.
    This means that you have to handle the BOM, and the two BE and LE alternatives of UTF-16 in the same automata. Who
    said that HTML5 wanted to promote the simplest implementations?

    Even UTF-32 does not even needs any BOM (because it is self-ordered by the position of the NUL byte). The bad thing
    is that it is clearly a waste of space, and UTF-32 does not work at all within null-terminated C/C++ Strings (the
    same is true for UTF-16 in its three variants, because the code units are 16-bits, and the low-order byte of each
    code unit can be null.

    But given that null-terminated strings are really considered as unsafe datatypes (adding their own complication to
    handle BOTH the null-termination of strings, AND the allocated length of buffers, when a single representation of
    the length would be enough. Pascal-style strings are safe (but limited in their length, if it is stored as a single
    byte). Java-style strings are safe and do not require any specific handling of the null byte (which can be
    eliminated very early from input stream decoders, where NULL bytes are already invalid in conforming documents,
    including all HTML4, HTML5, and XML, when using a multibyte decoder for encodings that do not use the NULL byte to
    represent code-units larger than 8-bits).

    BOCU-1 is also compatible with the safe encoding requirement (where NUL bytes are rejected to avoid securities
    issues related to unexpected string truncation, for example in SQL requests where the encoded string could be
    injected, if using embedded strings instead of the variable binding mechanism)

    > > Clearly, UTF-16BE and UTF-16LE are the simplest encodings, with less
    > > states, it will probably be more secure and definitiely fasterto
    > > compute for very large volumes at high rates (such as in memory).
    >
    > Because of the surrogate mechanism, there is no way I personally would
    > consider UTF-16 to be "simpler" than UTF-32. In the best case, it is
    > "as simple as" UTF-32. It has other advantages, mostly related to size,
    > but simplicity over UTF-32 is not one of them.

    Really, you can't make a distinction between states like you do here. A state is a state (in terms of finate state
    automata: it is NOT an integer value but an arbitrarily numbered enumerated value). And all multibyte encodings need
    at least one additional integer value to store the weighted values of the previous bytes componsing a sequence, or
    at least a small buffer for storing these bytes that will be decoded only at end of the operation separating the
    byte sequences.

    Separating byte sequences making a single character for UTF-16 (even with the surrogates which are easy to pair and
    count directly within the finite state automata) is definitely simpler than with UTF-8. The relative source code
    size for the decoder of the three UTF-16 variants in ICU is even smaller than UTF-8 (this is a good indicator as
    well for code correctness and its security, as a longer code requires more complex tests to reach the full code
    coverage). It is also much simpler to reject valid sequences representing forbidden characters with UTF-16 than with
    UTF-8 (considering the subset of UCS characters allowed in HTML4, HTML5, XHTML, XML, CSS, Javascript...).

    Using BOM-less UTF-32 (with local native byte ordering as it will just use 32-bit code units as a whole instead of
    multiple bytes, and aligning them in memory for performance reason) will still remain less efficient than UTF-16.
    Note that applications do not simply have to consider only the code units for treating characters isolately: very
    often, they have to consider sequences of characters, for handling string normalization or just because a single
    character ios not meaningful enough linguistically:

    There's no use in actual data for defective sequences, when applications will need to work at least at the level of
    grapheme clusters. At this level, everything will have variable length (independantly of the encoding chosen). What
    is chosen at the single chjaracter level is not relevant for application design and its security. We are always
    speaking about how to handle variable-length strings, and it is most often at this level that security issues
    appear: the fact that this text will use UTF-16 (for compactness and so for faster processing with higher efficiency
    of data caches, as long as memory remains addressable at least at every 16-bit boundary) or UTF-32 will not change
    this.

    UTF-8 will still keep its avantages for transmission in an heterogeneous environments like networks and storage
    (including storage on network services like database servers, or on removable medias and mounted filesystems, which
    can both be used directly by external applications including for purely local administration purpose with alternate
    tools), but only because of the independant of its byte ordering (but it is really poor for Asian texts), when it is
    used for storing reallatively small documents. But for massive storage of many texts or very large texts, UTF-8 will
    remain quite poor: you'll still need an external compressor. For their transmission over a relatively slow network
    like an Internet link, you'll still use classic binary compressors (either within an archive file format, or within
    the transmission protocol).

    Asian users will hate HTML5 if it forbids them to use BOCU-1 or SCSU and forces them to use the costly UTF-8
    encoding (it's possibly not a problem in Japan or South Korea where the Internet speed is much higher than in the
    rest of the world, but billions of users in China and India will hate HTML5 : one third of the whole humanity, isn't
    that important enough?). Now consider users in Russia, or South-Eastern Europe and in the Middle-East. Their
    presence on the Internet is also very developed, but HTML5 will not be for them. Clearly HTML5 is extremly highly
    biased in favor of countries that are mostly speaking English only (or some languages that use a Latin alphabet with
    a relatively low usage of non ASCII characters) that are still quite late at delivering decent Internet speed for
    their whole territory because they have very large rural areas with low population density (this includes USA,
    Canada, but also Brasil, and in fact many European countries as well, except the smallest ones without complicated
    geographies like in the Benelux).

    Most small islands countries that have also very slow or costly Internet, use English of French. They won't be
    impacted much by the encoding bias chosen in HTML5. But they represent a very tiny market and a small population.

    Ignoring the middle-East, China, India, Russia, Thailand, Indonesia (at least) is a severe error. HTML5 is just
    saying to them: simply don't use any Unicode-based encoding, keep your existing national encodings (or use one of
    the legacy Windows encodings). I really think it is stupid to forbid any Unicode-based encoding in HTML5.

    In fact I would have much prefered to see HTML5 forbid all non-Unicode based encodings or those that are not free of
    patents or that may be mapped ambiguously: this would have meant, forbidding the use of Windows encodinds, including
    US-ASCII, ISO 8859 encodings, BOCU-1, ISCII, VISCII, GB2312 and GB18030, TIS-620, and all EBCDIC variants, as well
    as various PC codepages made by IBM, Microsoft, Apple, Adobe, ... This would have really saved a lot of programmers
    time.

    There's absolutely no time lost when accepting all the standard UTF's or SCSU, as this effort will benefit to ALL
    and will allow interesting alternatives for specialized environments or computing architectures where alternate
    encodings could be better. The rejection of SCSU in HTML5 is completely stupid and counter-productive.



    This archive was generated by hypermail 2.1.5 : Sun Dec 27 2009 - 17:33:27 CST