From: Doug Ewell (dewell@adelphia.net)
Date: Sat Feb 03 2007 - 18:14:54 CST
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.
>
> This is correct if one sees the leading BOM as if it was encoding a
> significant codepoint which is part of the text, and then encoded with
> the normal BOCU-1 algorithm; however this paragraph does not state the
> effect of this encoded sequence on the current state of the encoder
> and of the decoder.
>
> This may cause a difference when interpreting the next bytes after
> this BOM, if this is not an ASCII byte, because the initial state is
> normally prev=0x40; but according to the BOCU-1 profile, this sequence
> should change the state to prev=0xFEC0 (according to rule R5
> Adjustment: "d. Otherwise, set prev to the middle of a 128-block:
> prev=(c&0x7F)+40.
I covered all of this three years ago in Unicode Technical Note #14.
Look for the paragraph in the BOCU-1 section that begins "Because each
character..."
It's possible to encode a signature safely in BOCU-1 by following it
with an FF reset byte, as you and Frank observed, but the spec
discourages FF resets.
-- Doug Ewell * Fullerton, California, USA * RFC 4645 * UTN #14 http://users.adelphia.net/~dewell/ http://www1.ietf.org/html.charters/ltru-charter.html http://www.alvestrand.no/mailman/listinfo/ietf-languages
This archive was generated by hypermail 2.1.5 : Sat Feb 03 2007 - 18:17:17 CST