RE: Stateful?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed May 28 2008 - 07:03:09 CDT

  • Next message: Marcin ‘Qrczak’ Kowalczyk: "Re: Stateful?"

    Kenneth Whistler
    > John Jenkins said:
    > > > UTF-16, after all, is stateful: if you lose the BOM, things
    > can look
    > > very different.
    >
    > That is true of the UTF-16 encoding *scheme*. (See TUS 5.0,
    > D98, p. 106.)

    No, I also included the encoding *forms* as well, without reference to the
    byte order, but just the relative order of code units.

    The BOM is another case where you need *another* state variable. But even in
    encoding schemes without any BOM, you need a state variable to parse the
    encoded text. This is true for all encodings that are handling streams of
    code units or streams of bytes, and in fact any data stream where there's
    always a relative order needed to interpret them (at least up to the level
    of bits, or discrete numeric symbols in communication and transport
    devices).

    The *only* level that is stateless in Unicode is the level of the stream of
    *code points* that have their own distinctive identity and their own
    properties independantly of their context, but may still be given some
    additional infered properties from the context (such as the effective
    directionality for characters with weak or neutral directionaly). Code
    points exist as undividable single points in a well defined discrete space
    (i.e. as elements in a finite set), and they don't have any "length" or
    "current state", their cardinality is always 1 for encoding only their
    existence.

    Their effective representation (as a integer number or as a boolean bitset
    with just one bit set to one) is not relevant, and not even their relative
    order (the encoding space itself has no dimension, even if it is enumerable,
    and efectively defined along with a normative enumeration that maps them to
    integers, but without saying that they would be integers themselves, as code
    points have *no* defined arithmetic behavior except in small subsets of the
    space for some applications).

    Of course, to handle code points in computers, you need at least an encoding
    form (at the interface level) or scheme (for the effective storage or
    transmision). Such mapping (encoding and decoding) always requires a
    stateful operation, even for the simplest UTF-32BE or UTF32-LE encoding
    schemes. The good question to ask is where the state variable resides: in
    the encoder/decoder themselves, but not anywhere in the data stream of code
    units, bytes or bits. Such state varaible gets set in operations known as
    "I/O" operations: all these operations are ordered (in processing time, or
    storage address, or relative position in the stream).

    Saying that any encoding scheme or form is stateless is completely false:
    all you can say is that some representation require *less* free state
    variables than others, but you absolutely cannot exclude *all* state
    variables. As a consequence, *all* Unicode encoding schemes or forms are
    stateful (the only exception is the UTF-32 encoding form when working with
    it at the interface level, because this is the only standardized
    representation that uses a bijective one-to-one mapping between code points
    and numeric code units).

    You can also compare the various schemes or forms by the amount of space
    needed to store these state variables:
    * for UTF-8 streams of bytes without BOM, this space is a number from 0 to 3
    (so it requires two bits) ;
    * for UTF-8 streams of bytes with possible BOM, you need another bit of
    state to represent the presence or absence of the leading BOM ;
    * when working at the bit level, you need three other bits of state to
    represent the bit position and order in bytes.
    * You can do the same kind analysis for UTF-16 and UTF-32 encoding forms and
    schemes, but you'll need also a few bit variables state variable to
    represent the byte order and relative position of bytes in streams in that
    order.

    The number of state variables needed is zero *only* for the standardized
    UTF-32 encoding form (or for any non-standard encoding forms that represent
    code points in any numeric representation capable of storing about 21 bit of
    information with at least 17×2^16 distinct values). Some state variables are
    implied by the hardware architecture handling the representation and cannot
    be changed easily at the software level without costly conversion operations
    (such as bit reordering), but they do not "disappear": they are effectively
    implemented by the computing host. When working at the level of encoding
    *schemes* (not *forms*), these variables are always present and must be
    supported by the software handling them, in order to have or rebuild eaily
    usable code units.

    So there exists absolutely NO "stateless" encoding schemes. Another way to
    say it: ALL encoding *schemes* are "stateful", even if you don't immediately
    perceive the effective need of these state variables, and even if these
    variables are very few, extremely small and simple to handle in software
    (but sometimes handling them can be quite costly in terms of application
    performance or needed computing resource, especially when it requires bit
    reordering).



    This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 10:14:16 CDT