RE: Stateful?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue May 27 2008 - 09:38:34 CDT

  • Next message: John H. Jenkins: "Re: Stateful?"

    Jeroen Ruigrok van der Werven wrote:
    > Say you have defined an stateful object that object can tell
    > you about its datatype, probably memory use, and so on. Of
    > course, this all depends on what is defined to be the 'state'.

    Of course yes! UTF-8 (and also UTF-16) is a stateful encoding because you
    have to remember the state of the previous leading bytes (or leading high
    surrogate in UTF-16) when a non-leading byte (or non-leading low surrogate
    in UTF-16) occurs.

    However, this state is bounded in UTF-8 (not in ISO 2022), where you may
    need to remember the state for unlimited distance from where it was set aby
    another prior code.

    "Stateful" is not a particularly useful distinction for encodings. In fact,
    almost everything we handle is stateful (starting at least at the bit level:
    you need to keep the state of some other prior bits to recognize distinct
    codes for distinct characters).

    What is more productive, when speaking about encodings, is the minimum
    distance (in terms of volume, or time of transmission...) at which the state
    is fully defined, because it also conditions other things, notably:
      - the resistance to errors of transmission, or recoverability from such
    errors, or
      - the searchability from arbitrary position in the middle of text: how
    much do you have to read backward from an arbitrary position in order to be
    sure to decode the rest of the text correctly with all the needed decoding
    state variables correctly defined unambiguously?
      - can you predict this backward distance in a limited set of read
    operations?

    UTF-8 and UTF-16 resist to the three conditions above with a finite/bounded,
    small, and fully predictive number of operations (requiring a fully
    predetermined finite set of state variables), when ISO 2022 does not offer
    the same features (even if it requires a finite set of state variables, it
    does not offer full predictability for searches from arbitrary position in
    large texts, so its processing is "almost necessarily" sequential only,
    unless you use some heuristic "guessing", similar to the one used in web
    browsers to guess which encoding is used in some web page without explicit
    meta-data specifiying it and you are prepared to accept: the existence of
    false guesses, or errors, or need to redecode the same text starting from
    several other positions and see what makes the more "sense" for your
    application).



    This archive was generated by hypermail 2.1.5 : Tue May 27 2008 - 10:47:25 CDT