Re: Stateful encoding mechanisms

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu May 19 2005 - 19:08:02 CDT

  • Next message: Kenneth Whistler: "Re: Stateful encoding mechanisms"

    In addition to clarifications provided by Peter and Philippe,
    which I won't repeat, ...

    > Surely you are not denying that surrogates, ... are stateful mechanisms?
    > It is irrelevant for the discussion
    > of stateful mechanisms in encoding and the problems they pose for
    > fragment interpretability whether or not those mechanisms are in the
    > text content; they are in the text stream and must be dealt with.

    Surrogate pairs are *not* a stateful mechanism in the sense
    that that term is generally applied to character encodings.

    Dean quoted:

    > SURROGATES:
    >
    > The Unicode Standard 4.1, section 3.9
    > "In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
    > represented as
    > <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302."

    He failed to quote the parallel text nearby:

    "In UTF-8, the code point sequence <004D, 0430, 4E8C, 10302> is
    represented as <4D D0 B0 E4 BA 8C F0 90 8C 82>, where ...
    <F0 90 8C 82> corresponds to U+10302."

    This is not "stateful" -- in both cases it is simply an encoding
    scheme that has a non-one-to-one mapping of code units to
    encoded character.

    In UTF-16, 0xD800 does not set a "state" which then alters the
    interpretation of a subsequent code unit. 0xDF02 has its own, unique
    status, regardless of what precedes or follows it. Some sequences
    are valid, some are not -- that's all.

    In UTF-8, 0xF0 does not set a "state" which then alters the
    interpretation of a subsequent byte. 0x90 has its own, unique
    status, regardless of what precedes or follows it. Some sequences
    are valid, some are not -- that's all.

    The ISO 2022 framework, on the other hand, *is* generally acknowledged
    to be a stateful approach to character encoding. See the example
    shown in Figure 1-2, p. 4 of TUS 4.0:

    The presence of the byte sequence <1B 2D> in an ISO 2022 text stream
    *alters* the interpretation of an immediately following 0x46 byte
    from being LATIN CAPITAL LETTER F to being a code set shifter picking
    the character set ISO 8859-7, which sets a further state changing
    the interpretation of all subsequent bytes in the stream (until
    the next escape sequence).

    The presence of the byte sequence <1B 24 42> in an ISO 2022 text stream
    *alters* the interpretation of an immediately following 0x46 byte
    from being LATIN CAPITAL LETTER F to being the initial byte of
    a two byte Shift-JIS encoding of the Japanese ideograph for
    'hi' "day", and sets a state changing the interpretation of all
    subsequent bytes in the stream (until the next escape sequence).

    *That* is stateful character encoding.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 19:09:13 CDT