Re: Stateful encoding mechanisms

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu May 19 2005 - 19:08:02 CDT

Next message: Kenneth Whistler: "Re: Stateful encoding mechanisms"

Previous message: Mark Davis: "Re: ASCII and Unicode lifespan"
Maybe in reply to: Dean Snyder: "Stateful encoding mechanisms"
Reply: Dean Snyder: "Re: Stateful encoding mechanisms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

In addition to clarifications provided by Peter and Philippe,
which I won't repeat, ...

> Surely you are not denying that surrogates, ... are stateful mechanisms?
> It is irrelevant for the discussion
> of stateful mechanisms in encoding and the problems they pose for
> fragment interpretability whether or not those mechanisms are in the
> text content; they are in the text stream and must be dealt with.

Surrogate pairs are *not* a stateful mechanism in the sense
that that term is generally applied to character encodings.

Dean quoted:

> SURROGATES:
>
> The Unicode Standard 4.1, section 3.9
> "In UTF-16, the code point sequence <004D, 0430, 4E8C, 10302> is
> represented as
> <004D 0430 4E8C D800 DF02>, where <D800 DF02> corresponds to U+10302."

He failed to quote the parallel text nearby:

"In UTF-8, the code point sequence <004D, 0430, 4E8C, 10302> is
represented as <4D D0 B0 E4 BA 8C F0 90 8C 82>, where ...
<F0 90 8C 82> corresponds to U+10302."

This is not "stateful" -- in both cases it is simply an encoding
scheme that has a non-one-to-one mapping of code units to
encoded character.

In UTF-16, 0xD800 does not set a "state" which then alters the
interpretation of a subsequent code unit. 0xDF02 has its own, unique
status, regardless of what precedes or follows it. Some sequences
are valid, some are not -- that's all.

In UTF-8, 0xF0 does not set a "state" which then alters the
interpretation of a subsequent byte. 0x90 has its own, unique
status, regardless of what precedes or follows it. Some sequences
are valid, some are not -- that's all.

The ISO 2022 framework, on the other hand, *is* generally acknowledged
to be a stateful approach to character encoding. See the example
shown in Figure 1-2, p. 4 of TUS 4.0:

The presence of the byte sequence <1B 2D> in an ISO 2022 text stream
*alters* the interpretation of an immediately following 0x46 byte
from being LATIN CAPITAL LETTER F to being a code set shifter picking
the character set ISO 8859-7, which sets a further state changing
the interpretation of all subsequent bytes in the stream (until
the next escape sequence).

The presence of the byte sequence <1B 24 42> in an ISO 2022 text stream
*alters* the interpretation of an immediately following 0x46 byte
from being LATIN CAPITAL LETTER F to being the initial byte of
a two byte Shift-JIS encoding of the Japanese ideograph for
'hi' "day", and sets a state changing the interpretation of all
subsequent bytes in the stream (until the next escape sequence).

*That* is stateful character encoding.

--Ken

Next message: Kenneth Whistler: "Re: Stateful encoding mechanisms"
Previous message: Mark Davis: "Re: ASCII and Unicode lifespan"
Maybe in reply to: Dean Snyder: "Stateful encoding mechanisms"
Reply: Dean Snyder: "Re: Stateful encoding mechanisms"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 19:09:13 CDT