Re: Problem with SSI and BOM

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Sep 22 2006 - 21:41:51 CDT

  • Next message: vunzndi@vfemail.net: "Re: Unicode 5.0 success"

    From: "Jukka K. Korpela" <jkorpela@cs.tut.fi>
    > On Fri, 22 Sep 2006, Mark Cilia Vincenti wrote:
    >
    >> I'm using SSI to include UTF-8 encoded files within a UTF-encoded
    >> HTML page on IIS (Internet Information Services). The problem is that
    >> the byte order mark is not being stripped by the SSI parser,
    >> resulting in BOMs within the HTML body.
    >
    > Can't you just remove the BOM? It's not needed in UTF-8 encoded data.

    I tend to agree: embedding blindly the UTF-8 text as is without applying a special encapsulation filter may result in HTML (or XML...) violations according to its own higher-level syntax.

    As soon as you realize this, you need a filter, and it's quite simple, when writing this filter, to test for the presence of a leading BOM in the text to encapsulate (unreading it if it's not a BOM) before applying the rest of the encapsulation where you'll need to detect occurences of "<" and "&" in the UTF-8 text (or if you choose to encapsulate it using "/*<![CDATA[*/ ... /*]]>*/", you'll basically just need to detect "]]>" which is more rare (but don't forget that the UTF-8 text may also contain unwanted controls that are forbidden in the HTML/XML data, and that HTML/XML treats several distinct encodings of newlines as if it was a single LF control, so extra filtering may be needed if you want to preserve the exact sequence of code points.

    This is not specified in the Unicode standard; refer to the higher protocol about how to encapsulate arbitrary text in a HTML/XML text element...



    This archive was generated by hypermail 2.1.5 : Fri Sep 22 2006 - 21:43:24 CDT