Re: HTML5 encodings

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sun Dec 27 2009 - 18:10:55 CST

  • Next message: David Starner: "Re: HTML5 encodings"

    It has been suggested that the number of states needed in a finite state
    machine is a useful metrics in comparing encoding schemes. I suggest a
    different point of differentiation that has more widespread practical
    importance.

    Legacy double byte encodings require a very small number of machine
    states to decode, but anyone who has experience with these encodings
    knows how difficult they can be to work with in practice. Clearly, it is
    then not the size of the finite state machine that matters.

    What distinguishes the UTFs from encodings such as DBCS and ISO-2022 are
    two other metrics:
    1. the number of input bytes needed from random access to determine the
    state
    2. the number of input bytes for which state information is to be maintained

    These two metrics are similar, but not identical.

    The first speaks directly to the topic of resynchronization. In the
    legacy DBCS encodings, certain byte values satisfied the conditions to
    be either a leading byte or a trailing byte. The encoding scheme imposes
    no limit on the length of runs of such bytes, making resynchronization,
    in the worst case, the same as re-reading the data stream from the
    start. Compare that to the UTFs where the worst case requires examining
    4 bytes to resynchronize. (4 bytes allows error checking and
    identification of the character, while three bytes is enough to
    calculate where the adjacent character boundaries must be in well-formed
    input).

    The second metric refers to encodings like ISO-2022 or SCSU which use
    control bytes or sequences switch among character sets. There are cases,
    where such as scheme could be set up to allow easy resynchronization in
    terms of character boundaries, yet still require that state information
    be maintained for very long (unbounded) stretches of data. Assume 2022
    style combination of several single byte character sets. If that
    restriction is known (by announcement), then resynchronizing to any
    character boundary is trivial (as long as you recognize and avoid the
    escape codes). However, interpreting (or correctly converting) any given
    character is impossible without going back to the most recent character
    set switching escape code.

    Except in the limiting case of a straight-through converter, that is, in
    most practical architectures, there are many situations where these two
    metrics describe quite well how "difficult" an encoding is.

    A./



    This archive was generated by hypermail 2.1.5 : Sun Dec 27 2009 - 18:13:47 CST