Re: HTML5 encodings

From: Doug Ewell (
Date: Sun Dec 27 2009 - 22:09:59 CST

  • Next message: Peter Krefting: "Re: HTML5 encodings (was: Re: BOCU patent)"

    Asmus Freytag <asmusf at ix dot netcom dot com> wrote:

    > The second metric refers to encodings like ISO-2022 or SCSU which use
    > control bytes or sequences switch among character sets. There are
    > cases, where such as scheme could be set up to allow easy
    > resynchronization in terms of character boundaries, yet still require
    > that state information be maintained for very long (unbounded)
    > stretches of data. Assume 2022 style combination of several single
    > byte character sets. If that restriction is known (by announcement),
    > then resynchronizing to any character boundary is trivial (as long as
    > you recognize and avoid the escape codes). However, interpreting (or
    > correctly converting) any given character is impossible without going
    > back to the most recent character set switching escape code.

    BOCU-1 has a handy "reset" mechanism, in which the byte 0xFF doesn't
    participate in the encoding of any character, but simply resets the
    state of the encoder or decoder. If desired, these could be inserted at
    certain intervals within a stream to ensure the availability of a
    synchronization point, solving the problem above.

    However, such a mechanism naturally means a code point sequence could be
    encoded in BOCU-1 in more than one way, and it could interfere with the
    seemingly all-important binary-ordering property of BOCU-1, so the
    authors apparently felt compelled to invoke the Principle of

    "Using FF to reset the state breaks the ordering! The use of FF resets
    is discouraged."

    The reset mechanism doesn't seem to be mentioned in the BOCU patent.

    Doug Ewell  |  Thornton, Colorado, USA  |
    RFC 5645, 4645, UTN #14  |  ietf-languages @ ­

    This archive was generated by hypermail 2.1.5 : Sun Dec 27 2009 - 22:12:11 CST