RE: Problem with SSI and BOM

From: Mark Cilia Vincenti (mark@gfi.com)
Date: Wed Sep 27 2006 - 01:21:54 CST

  • Next message: Stephane Bortzmeyer: "Re: Unicode & space in programming & l10n"

    It *is* a problem, because we are using SSI (server-side include) tags
    on IIS (Windows' web server), which doesn't allow for a conversion
    filter. There are no configuration settings, so unless someone wrote a
    different DLL that allows for removal of BOM, then there would be no way
    for me to strip it inside the body if it is present in the template
    files.

    HTML conformance is only secondary. The main problem is that the page is
    not being displayed properly.

    Best Regards,

    Mark Cilia Vincenti - Internal Developer - Marketing
    GFI Software - www.gfi.com

    -----Original Message-----
    From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
    Sent: 26 September 2006 11:09 PM
    To: Mark Cilia Vincenti; Addison Phillips; Jukka K. Korpela
    Cc: unicode@unicode.org
    Subject: Re: Problem with SSI and BOM

    From: "Mark Cilia Vincenti" <mark@gfi.com>
    > Conclusion: the BOM is important to have. Some text editors eg Notepad
    > don't even allow you to save the file without it. But the BOM inside
    > HTML code is also bad as it's putting in empty lines each time. I'm
    just
    > wondering if there's a way I can apply the includes with some other
    > means that recognises the BOM and doesn't include it as well.

    I don't see the presence of a BOM as a severe problem for HTML or XML:
    * if you are building the HTML file by combinining several plain-text
    sources, you already must use a conversion filter for some characters
    like "<" and "&" (or like "]]>" if you convert your plain-text into a
    CDATA section of a anonymous text element); once you realize that a
    conversion filter is necessary, and that the filter may need to be
    contextual (for CDATA), I really wan't see where is the problem with
    stripping a leading BOM from a plain-text file.

    The only problem I see is for HTML/XML conformance: the BOM, if
    interpreted as a character, would violate the document structure as it
    would mean it is a text element before anything else (including before
    the XML declaration, or the document's root element); for this reason,
    XML parsers are detecting the presence of the BOM, and use it as a hint
    regarding a UTF encoding (if this encoding is still not known before
    when the perser is instanciated), and this codepoint is discarded (not
    fed in the XML/HTML parser).

    But if you have a BOM which looks like a UTF-* BOM in a document to be
    parsed with a non UTF-* encoding, it is no more a BOM, but the encoding
    of some character(s) at the begining of the document. When you detect
    the XML declaration, if it specified a non-UTF encoding, the document
    must be parsed again from the begining, and then there will be no BOM to
    discard, but then the document will be non-conforming according to XML,
    because an anonymous text element occurs before the (optional) XML
    declaration or even before the root element!

    In all XML parsers that I have seen, the presence of the BOM at the
    begining of a UTF-* encoded document is not parsed as a character of a
    text element, and is accepted even before the XML declaration. This is
    convenient because it allows editing XML/HTML files with plain-text
    editors that most often insert a BOM when saving files using a UTF-8 or
    UTF-16 encoding.

    I personnaly consider that the BOM has a great interest, notably since
    that ZWNBSP is no longer used as a character and another character has
    been defined for the same text semantic.

    I think it's high time to consider ZWNBSP as a fully ignorable
    character, even in the middle of the text (consider it like padding
    nulls in old serial commnication protocols), whose role is clearly to be
    used as a byte order mark and for the detection of the encoding
    effectively used. So a process should be free to add or remove any
    occurence of U+FEFF in a text stream without having to interpret it now
    as a possible character.

      
    This mail was checked for viruses by GFI MailSecurity.
    GFI also develops anti-spam software (GFI MailEssentials), a fax server (GFI FAXmaker), and network security and management software (GFI LANguard) - www.gfi.com



    This archive was generated by hypermail 2.1.5 : Wed Sep 27 2006 - 01:27:39 CST