RE: The future of UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jul 22 1999 - 16:57:21 EDT


Gianni,

> If you need to process BOM's (10646 signatures) it is then stateful.

How so?

The Unicode character encoding itself is not stateful.

The UTF-16 encoding form is not stateful.

The UTF-16BE and UTF-16LE UTF's (serializations) are not stateful.

UTF-16 as a UTF (serialization) is ambiguous as to the byte order
of the serialization. That ambiguity is resolved in one of several
ways:
   1. A higher order protocol. At which point, the data processing
          is not stateful.
   2. By detection of a BOM. When the BOM is detected and interpreted,
          the data processing of the textual content is not stateful.
   3. By heuristics. And while the heuristic processing itself might
          be stateful, once the outcome of the heuristic provides
          an answer for the byte order, subsequent processing is
          not stateful. And this is in effect no different that any
          heuristic applied to detect character set, whether that
          character set itself is a stateful encoding or not.

The term "stateful", as applied to character encodings, usually
is referring to architectures like ISO 2022, where the state
induced by an escape sequence must be retained to interpret all
subsequent bytes, until encountering another escape sequences changes
the state, and thus the interpretation of the next run of bytes.
That is quite different from determination of the byte polarity "state"
on a data type before processing it. If that were the case, then you
could equally well claim that processing of any integral datatype
larger than a byte is "stateful" in a cross-platform environment.
But that is diluting the term "stateful" in the character encoding
context down to the point where it has nothing in common with
its intended applicability.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT