Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM

From: Doug Ewell (dewell@adelphia.net)
Date: Sun Feb 04 2007 - 14:18:42 CST

  • Next message: Philippe Verdy: "Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > I wanted to signal this because I noted that what was a technical note
    > is now displayed as a draft for becoming a UTS (with differences
    > emphasized, notably for the conformance requirement). I had read this
    > doc a long time ago, and there was no such "draft" status so it was
    > not really a problem. Also the licencing issue is still not resolved.
    > So the final drat should be listed in the Public Review page to fix
    > the exact wording.
    >
    > The main risk is caused by the ambiguity of the sentence which does
    > not indicate that it really encodes the codepoint U+FEFF normally
    > (i.e. it changes the current state), and that does not specify if the
    > leading BOM is required or optional.

    I'm sure it would not be difficult to edit Section 2.5 to explain this,
    something like:

    "An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.
    Note that adding or stripping an initial U+FEFF generally requires the
    next code point above U+0020 to be re-encoded."

    > If encoding the reset byte FF is not recommended, then the leading BOM
    > should not be recommended either, because this is a concatenation of
    > an unrelated substring to the text. that's where i think that, in that
    > case, the BOM, if used, should better be followed by a RESET byte,
    > even if the rest of the document does not use any RESET byte.

    FF resets can also improve compression, particularly when a character
    beyond U+2980 is followed by a Basic Latin character. If I were a legal
    stakeholder in the BOCU project, I would have taken the italicized
    passage in Section 2.4:

    "Using FF to reset the state breaks the ordering and the deterministic
    encoding! The use of FF resets is discouraged."

    and added:

    "... in applications where these features are more important than
    optimum compression."

    To me these are all implementation details and can be easily worked out,
    whereas the patent encumbrance is a showstopper.

    --
    Doug Ewell  *  Fullerton, California, USA  *  RFC 4645  *  UTN #14
    http://users.adelphia.net/~dewell/
    http://www1.ietf.org/html.charters/ltru-charter.html
    http://www.alvestrand.no/mailman/listinfo/ietf-languages
    


    This archive was generated by hypermail 2.1.5 : Sun Feb 04 2007 - 14:21:05 CST