UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Feb 03 2007 - 07:23:13 CST

  • Next message: Philippe Verdy: "Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM"

    UTS#40 (BOCU-1) contains an ambiguity about the effect of the Signature Byte Sequence (in paragraph 2.5). it just says this:

    An initial U+FEFF is encoded in BOCU-1 with the three bytes FB EE 28.

    This is correct if one sees the leading BOM as if it was encoding a significant codepoint which is part of the text, and then encoded with the normal BOCU-1 algorithm; however this paragraph does not state the effect of this encoded sequence on the current state of the encoder and of the decoder.

    This may cause a difference when interpreting the next bytes after this BOM, if this is not an ASCII byte, because the initial state is normally prev=0x40; but according to the BOCU-1 profile, this sequence should change the state to prev=0xFEC0 (according to rule R5 Adjustment: "d. Otherwise, set prev to the middle of a 128-block: prev=(c&0x7F)+40.

    A warning should be noted so that the leading BOM is not removed or added blindly without also eventually correcting the next bytes (possibly a large number!) after it if they are not encoding ASCII characters.

    A difference is possible, for example, if the first codepoint (after the BOM) to encode is:
    * in the range U+FE80 to U+FEFF, because it will be be encoded as a single byte (from state prev=0xFFC0), instead of 3 bytes.

    In fact, the only case where removing or adding blindly a encoded leading BOM from/into a BOCU-1 encoded stream is when the first significant codepoint is ASCII and not a SPACE (U+0020); if the first character is a space, you have to lookup the next codepoint and loop to test it again (because the space does not alter the current state, so it does not reste the state to prev=0x0040)!

    This paragraph also does not state clearly if the leading BOM is either:
    * mandatory (I think it is not, like in other UTF's)
    * optional but recommanded (like with UTF-16)
    * optional but not recommanded (like with UTF-8, or UTF-32)
    * forbidden (like with UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE) (always preserved as a significant character; I think this is not the case, otherwise the paragraph 2.5 would not be present)

    This bug is reported with the report form.



    This archive was generated by hypermail 2.1.5 : Sat Feb 03 2007 - 07:26:01 CST