Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM

From: Frank Ellermann (
Date: Sat Feb 03 2007 - 11:47:32 CST

  • Next message: Hans Aberg: "Re: New translation posted"

    Philippe Verdy wrote:
    > There's another editorial(?) error in UTS#40, in rule RD6 which states:
    > RD6 Reset Byte: If lb is equal to 0x20, then do not output any code point
    > but set prev=0x40. Continue with the next byte sequence
    > Of course it should not be 0x20 but 0xFF! otherwise it conflicts with
    > rule RD3 (space).

    Yes, obvious typo, the next and last line of RD6 says:
    | * FF is a "reset-state-only" byte.

    In another article you asked what it's good for. You could use it to
    concatenate unknown (but otherwise valid) BOCU-1 strings. You could
    also use it if a source contains no other (or not enough) code points
    causing a reset to prev=0x40, i.e. for strictly non-ASCII sources,
    (ignoring SP, that doesn't change the state).

    A single bit damaged can destroy a complete "line" of anything not in
    state prev=0x40. FF allows to limit the "line length" at risk. The
    disadvantages of FF are clearly stated in the last paragraph of 2.4.

    In a third article you noted that a signature FB EE 28 can't be simply
    removed, it has a side effect on the state. That's true, it could be
    noted in chapter 2.5. Using FB EE 28 FF (without side effect) is also
    possible, but I think that's a dubious kludge. Nobody promised that
    removing signatures is always possible without other effects.

    I don't think it's a "serious bug", it's only a potential trap, and if
    that's explicitly noted in chapter 2.5 it's a (harmless) "feature".


    This archive was generated by hypermail 2.1.5 : Sat Feb 03 2007 - 11:51:04 CST