RE: Stateful?

From: Kenneth Whistler (
Date: Tue May 27 2008 - 17:08:41 CDT

  • Next message: John H. Jenkins: "Re: Stateful?"

    Philippe Verdy said:

    > Of course yes! UTF-8 (and also UTF-16) is a stateful encoding because you
    > have to remember the state of the previous leading bytes (or leading high
    > surrogate in UTF-16) when a non-leading byte (or non-leading low surrogate
    > in UTF-16) occurs.
    > However, this state is bounded in UTF-8 (not in ISO 2022), where you may
    > need to remember the state for unlimited distance from where it was set aby
    > another prior code.
    > "Stateful" is not a particularly useful distinction for encodings. In fact,
    > almost everything we handle is stateful (starting at least at the bit level:
    > you need to keep the state of some other prior bits to recognize distinct
    > codes for distinct characters).

    O.k., I'm not going to let that one stand unchallenged, even though
    Phillipe does go on to provide appropriate analysis as to why the unbounded
    statefulness of ISO 2022 causes problems.

    The Unicode character encoding is *not* stateful in any
    meaningful CS sense of the term "stateful". Each character's
    identity is unambiguously determined by its code point --
    and unlike true stateful encodings such as EBCDIC SI/SO schemes
    or ISO 2022, there are no *encoding* states built in which
    change the interpretation of following characters.

    To claim that UTF-8 or UTF-16 are "stateful" because
    "you have to remember the state of the previous leading bytes"
    is just bogus.

    <E4 BA 8C> simply *is* the UTF-8 encoding form of U+4E8C --
    there is nothing "stateful" about it. To claim that it
    is stateful is as wrong as claiming, for example that
    the UTF-32 encoding form of U+4E8C, i.e. 00004E8C, is
    "stateful" because for the ...E8C to be part of the
    encoded character U+4E8C, the first 20 bits have
    to be 0x00004. Well, yes they do, but that doesn't
    make 00004E8C "stateful", either.

    There *are* parts of the Unicode Standard that involve the
    use of Unicode characters in stateful representations.
    Some of those have been mentioned here: Plane 14 Language Tags,
    and U+FFF9..U+FFFB interlinear annotation characters.
    To those you can add the U+2FF0..U+2FFB ideographic
    description characters. For that matter, the Unicode
    characters U+003C LESS-THAN SIGN and U+003E GREATER-THAN SIGN
    are used statefully in HTML. But all of those are *meta*
    phenomena. The encoding of the characters themselves in
    Unicode is not stateful. Nor are the UTF-8 and UTF-16
    encoding forms.

    John Jenkins said:

    > UTF-16, after all, is stateful: if you lose the BOM,
    > things can look very different.

    That is true of the UTF-16 encoding *scheme*. (See TUS 5.0,
    D98, p. 106.) That is because in the UTF-16 encoding scheme,
    an initial BOM is itself a stateful switch for byte order.
    UTF-16BE and UTF-16LE, on the other hand are not stateful.


    This archive was generated by hypermail 2.1.5 : Tue May 27 2008 - 17:11:25 CDT