Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Feb 03 2007 - 08:08:28 CST

  • Next message: Hans Aberg: "Re: New translation posted"

    I see absolutely no material utility or need for the RESET byte anywhere in a BOCU encoded stream, except for possibly allowing resynchronization points at regular intervals within the stream (to limit the backward lookup when accessing a file at random position: for example you could use a RESET byte after each time the stream crosses a 512 byte block boundary, i.e. a single sector on disk, to avoid extra hardware I/O delays.

    Beside this special use, this code allows no better compression. It should not be recommanded in interchanged files (such as MIME) as it breaks the intended design to allow comparing codepoints in binary order without decoding the stream. Its use should be limited to large plain-texts documents containing very few or no ASCII characters, not even the general purpose punctuation (some Han, Tibetan, ... texts). It should not be used in XML or HTML files (except after the leading BOM).

    What do you think about it? are there other things that I have missed?

    Note that this is a more serious issue because it affects a UTS, i.e. a standard document, not a simple technical note, so the conformance requirement is an important part of this document.

    ----- Original Message -----
    From: "Philippe Verdy" <verdy_p@wanadoo.fr>
    To: "Philippe Verdy" <verdy_p@wanadoo.fr>; <unicode@unicode.org>
    Sent: Saturday, February 03, 2007 2:48 PM
    Subject: Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM

    Note that because of this ambiguity, it should be recommanded to encode the special RESET byte (FF) after the leading byte, i.e. using a leading BOM as: FB EE 28 FF
    This way, the new state after BOM is reset to the initial value prev=0x0040!

    I also think that the sentence describing the 3 byte sequence should explicitly say that it effectively encodes the difference 0xFEFF-0x0040 = 0xFEBF using the base-243 encoding.

    Note that there's no provision elsewhere in the specification that indicates that U+FEFF resets the state (only the ASCII characters except SPACE, and the RESET code FF have this effect).



    This archive was generated by hypermail 2.1.5 : Sat Feb 03 2007 - 08:10:46 CST