Re: Stateful encoding mechanisms

From: Dean Snyder (
Date: Fri May 20 2005 - 13:53:37 CDT

  • Next message: Mark E. Shoulson: "Re: ASCII and Unicode lifespan"

    Tim Greenwood wrote at 1:24 PM on Friday, May 20, 2005:

    >On 5/19/05, Dean Snyder <> wrote:
    >> Well that, of course, depends on how you define state, acknowledgment of
    >> which, I presume, is related to both your qualified dissension and your
    >> use of quotes around the word "state" here.
    >While I do not agree that your definition of state matches that
    >commonly accepted, it is a coherent argument.

    The surrogate mechanism and its UTF-8 analog are SELF-BOUNDING state
    mechanisms; whereas ones like the bidi mechanism are OTHER-BOUNDING.
    They are both stateful in that they exhibit co-dependency across atoms
    (code units).

    >However if you make that
    >argument then you must address Ken's other point. You criticise the
    >use of 'stateful' code units in UTF-16, yet do not do the same for
    >UTF-8. Why not? The structure of both is very similar.

    No particular reason, other than I consider it a side-stepping
    distraction from the discussion of surrogates.

    But, of course, the UTF-8 mechanism makes the same point I am making for
    UTF-16, in fact, it makes it even stronger. The fact that you may have
    to backtrack anywhere from one to three code units in order to interpret
    code unit sequences in UTF-8 makes it more fragment fragile than UTF-16
    - the stateful mechanism is spread over twice as many code units.

    As the Unicode Standard (section 2.5) says regarding multiple code units
    for single characters - "This property [self-synchronization] has
    another very important implication: corruption of a single code unit
    corrupts only a single character; none of the surrounding characters are

    That, of course, is the ingenuous sheep's clothing; the wolf inside the
    sheep's clothing however is the complexity and its concomitant fragility.

    But in referring back to one of my main points: when, in the future, we
    move to a monolithic 4-byte text encoding architecture this all becomes
    needless complexity and none of this statefulness between code units and
    code points would exist.

    In such an era I suggest we refer to the text encoding atom as a
    "gulp" (as opposed to the current "byte" ;-)


    Dean A. Snyder

    Assistant Research Scholar
    Manager, Digital Hammurabi Project
    Computer Science Department
    Whiting School of Engineering
    218C New Engineering Building
    3400 North Charles Street
    Johns Hopkins University
    Baltimore, Maryland, USA 21218

    office: 410 516-6850
    cell: 717 817-4897

    This archive was generated by hypermail 2.1.5 : Fri May 20 2005 - 13:58:00 CDT