Re: Stateful encoding mechanisms

From: Dean Snyder (
Date: Thu May 19 2005 - 20:15:19 CDT

  • Next message: Dean Snyder: "Re: ASCII and Unicode lifespan"

    Ken provides a qualified dissension by stating:

    >Surrogate pairs are *not* a stateful mechanism in the sense
    >that that term is generally applied to character encodings.

    then proceeds:

    >In UTF-16, 0xD800 does not set a "state" which then alters the
    >interpretation of a subsequent code unit. 0xDF02 has its own, unique
    >status, regardless of what precedes or follows it.

    Well that, of course, depends on how you define state, acknowledgment of
    which, I presume, is related to both your qualified dissension and your
    use of quotes around the word "state" here.

    Let me make my case for the statefulness of surrogates more explicitly.

    If <0xD800 0xDF02> is interpreted differently than <0xD801 0xDF02>, then
    the high surrogate is altering the interpretation of 0xDF02, the low
    surrogate. I assert that that is stateful in the context of discussing
    fragment fragility. The issue is you have the surrogate state being
    established, and that, by definition, requires twice the number of code
    units to establish any given code point - if either code unit is missing
    the remaining code unit is uninterpretable. This co-dependency spans the
    code unit level which fact, from a fragment fragility perspective, makes
    the whole surrogate mechanism stateful.

    By the way, can you indeed tell us what the "unique status" of the code
    unit 0xDF02 is? And if it has one, why it is not spelled out in the standard?

    Dean A. Snyder

    Assistant Research Scholar
    Manager, Digital Hammurabi Project
    Computer Science Department
    Whiting School of Engineering
    218C New Engineering Building
    3400 North Charles Street
    Johns Hopkins University
    Baltimore, Maryland, USA 21218

    office: 410 516-6850
    cell: 717 817-4897

    This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 23:03:10 CDT