Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM

From: Philippe Verdy (
Date: Sun Feb 04 2007 - 14:55:54 CST

  • Next message: Doug Ewell: "Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM"

    From: "Doug Ewell" <>
    > Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
    >> The main risk is caused by the ambiguity of the sentence which does
    >> not indicate that it really encodes the codepoint U+FEFF normally
    >> (i.e. it changes the current state), and that does not specify if the
    >> leading BOM is required or optional.
    > I'm sure it would not be difficult to edit Section 2.5 to explain this,
    > something like:
    > "An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.
    > Note that adding or stripping an initial U+FEFF generally requires the
    > next code point above U+0020 to be re-encoded."

    ... unless there's a C0 control character (below U+0020) before such codepoint (above U+0020) occurs. There's no reencoding if the first non-SPACE character after the leading bom is a control like a end-of-line sequence or a tabulation, or if it's a character in the U+FE80..U+FEFF range.

    This archive was generated by hypermail 2.1.5 : Sun Feb 04 2007 - 14:58:01 CST