Re: Stateful encoding mechanisms

From: Dean Snyder (
Date: Fri May 20 2005 - 10:32:58 CDT

  • Next message: Philippe Verdy: "Re: ASCII and Unicode lifespan"

    Philippe Verdy wrote at 5:04 PM on Friday, May 20, 2005:

    >From: "Dean Snyder" <>
    >> By the way, can you indeed tell us what the "unique status" of the code
    >> unit 0xDF02 is? And if it has one, why it is not spelled out in the
    >> standard?
    >It is in the standard:
    >* the code unit 0xDF02 is a surrogate.
    >* the codepoint U+DF02 is permenantly a non-character:
    >* there's no assigned character on U+DF02, it will never be assigned to
    >character by ISO/IEC 10646-1 or Unicode, because it is already bound to a

    This does not define any unique status for 0xDF02; instead it defines a
    status that 0xDF02 shares with all the other 1023 low surrogates. A
    strange definition indeed of unique.

    The interpretation of 0xDF02 is context-bound and that, by definition,
    makes its "status" multiple, and therefore non-unique. Contrary to what
    Ken has implied ["In UTF-16, 0xD800 does not set a "state" which then
    alters the interpretation of a subsequent code unit"], the
    interpretation of 0xDF02 IS directly influenced by its preceding high
    surrogate. To put it another way, it is only the COMBINATIONS of high
    and low surrogates that yield unique results.

    Leaving out the BOM, the interpretations of all non-surrogate code units
    in a UTF-16 text stream are context-free; the interpretations of all
    surrogate code units in the same stream are context-bound. That is why I
    am referring to surrogates as a stateful encoding mechanism, and subject
    to fragment fragility.

    Dean A. Snyder

    Assistant Research Scholar
    Manager, Digital Hammurabi Project
    Computer Science Department
    Whiting School of Engineering
    218C New Engineering Building
    3400 North Charles Street
    Johns Hopkins University
    Baltimore, Maryland, USA 21218

    office: 410 516-6850
    cell: 717 817-4897

    This archive was generated by hypermail 2.1.5 : Fri May 20 2005 - 10:46:48 CDT