Re: Problem with SSI and BOM

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Sep 26 2006 - 15:09:22 CST

  • Next message: Jefsey_Morfin: "Re: Unicode & space in programming & l10n"

    From: "Mark Cilia Vincenti" <mark@gfi.com>
    > Conclusion: the BOM is important to have. Some text editors eg Notepad
    > don't even allow you to save the file without it. But the BOM inside
    > HTML code is also bad as it's putting in empty lines each time. I'm just
    > wondering if there's a way I can apply the includes with some other
    > means that recognises the BOM and doesn't include it as well.

    I don't see the presence of a BOM as a severe problem for HTML or XML:
    * if you are building the HTML file by combinining several plain-text sources, you already must use a conversion filter for some characters like "<" and "&" (or like "]]>" if you convert your plain-text into a CDATA section of a anonymous text element); once you realize that a conversion filter is necessary, and that the filter may need to be contextual (for CDATA), I really wan't see where is the problem with stripping a leading BOM from a plain-text file.

    The only problem I see is for HTML/XML conformance: the BOM, if interpreted as a character, would violate the document structure as it would mean it is a text element before anything else (including before the XML declaration, or the document's root element); for this reason, XML parsers are detecting the presence of the BOM, and use it as a hint regarding a UTF encoding (if this encoding is still not known before when the perser is instanciated), and this codepoint is discarded (not fed in the XML/HTML parser).

    But if you have a BOM which looks like a UTF-* BOM in a document to be parsed with a non UTF-* encoding, it is no more a BOM, but the encoding of some character(s) at the begining of the document. When you detect the XML declaration, if it specified a non-UTF encoding, the document must be parsed again from the begining, and then there will be no BOM to discard, but then the document will be non-conforming according to XML, because an anonymous text element occurs before the (optional) XML declaration or even before the root element!

    In all XML parsers that I have seen, the presence of the BOM at the begining of a UTF-* encoded document is not parsed as a character of a text element, and is accepted even before the XML declaration. This is convenient because it allows editing XML/HTML files with plain-text editors that most often insert a BOM when saving files using a UTF-8 or UTF-16 encoding.

    I personnaly consider that the BOM has a great interest, notably since that ZWNBSP is no longer used as a character and another character has been defined for the same text semantic.

    I think it's high time to consider ZWNBSP as a fully ignorable character, even in the middle of the text (consider it like padding nulls in old serial commnication protocols), whose role is clearly to be used as a byte order mark and for the detection of the encoding effectively used. So a process should be free to add or remove any occurence of U+FEFF in a text stream without having to interpret it now as a possible character.



    This archive was generated by hypermail 2.1.5 : Tue Sep 26 2006 - 15:13:20 CST