Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Feb 03 2007 - 18:14:54 CST

  • Next message: Philippe Verdy: "Re: New translation posted"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.
    >
    > This is correct if one sees the leading BOM as if it was encoding a
    > significant codepoint which is part of the text, and then encoded with
    > the normal BOCU-1 algorithm; however this paragraph does not state the
    > effect of this encoded sequence on the current state of the encoder
    > and of the decoder.
    >
    > This may cause a difference when interpreting the next bytes after
    > this BOM, if this is not an ASCII byte, because the initial state is
    > normally prev=0x40; but according to the BOCU-1 profile, this sequence
    > should change the state to prev=0xFEC0 (according to rule R5
    > Adjustment: "d. Otherwise, set prev to the middle of a 128-block:
    > prev=(c&0x7F)+40.

    I covered all of this three years ago in Unicode Technical Note #14.
    Look for the paragraph in the BOCU-1 section that begins "Because each
    character..."

    It's possible to encode a signature safely in BOCU-1 by following it
    with an FF reset byte, as you and Frank observed, but the spec
    discourages FF resets.

    --
    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Sat Feb 03 2007 - 18:17:17 CST