Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Feb 04 2007 - 15:46:29 CST

  • Next message: Doug Ewell: "Re: UTS#40 (BOCU-1) special handling of large blocks"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    >> I'm sure it would not be difficult to edit Section 2.5 to explain
    >> this, something like:
    >>
    >> "An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE
    >> 28. Note that adding or stripping an initial U+FEFF generally
    >> requires the next code point above U+0020 to be re-encoded."
    >
    > ... unless there's a C0 control character (below U+0020) before such
    > codepoint (above U+0020) occurs. There's no reencoding if the first
    > non-SPACE character after the leading bom is a control like a
    > end-of-line sequence or a tabulation, or if it's a character in the
    > U+FE80..U+FEFF range.

    Correct. Phrasing this in a clear and succinct way is left as an
    exercise.

    --
    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Sun Feb 04 2007 - 15:47:48 CST