RE: Problem with SSI and BOM

From: Mark Cilia Vincenti (mark@gfi.com)
Date: Wed Sep 27 2006 - 01:21:54 CST

Next message: Stephane Bortzmeyer: "Re: Unicode & space in programming & l10n"

Previous message: Jefsey_Morfin: "Re: Unicode & space in programming & l10n"
Maybe in reply to: Mark Cilia Vincenti: "Problem with SSI and BOM"
Next in thread: Philippe Verdy: "Re: Problem with SSI and BOM"
Reply: Philippe Verdy: "Re: Problem with SSI and BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

It *is* a problem, because we are using SSI (server-side include) tags
on IIS (Windows' web server), which doesn't allow for a conversion
filter. There are no configuration settings, so unless someone wrote a
different DLL that allows for removal of BOM, then there would be no way
for me to strip it inside the body if it is present in the template
files.

HTML conformance is only secondary. The main problem is that the page is
not being displayed properly.

Best Regards,

Mark Cilia Vincenti - Internal Developer - Marketing
GFI Software - www.gfi.com

-----Original Message-----
From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
Sent: 26 September 2006 11:09 PM
To: Mark Cilia Vincenti; Addison Phillips; Jukka K. Korpela
Cc: unicode@unicode.org
Subject: Re: Problem with SSI and BOM

From: "Mark Cilia Vincenti" <mark@gfi.com>
> Conclusion: the BOM is important to have. Some text editors eg Notepad
> don't even allow you to save the file without it. But the BOM inside
> HTML code is also bad as it's putting in empty lines each time. I'm
just
> wondering if there's a way I can apply the includes with some other
> means that recognises the BOM and doesn't include it as well.

I don't see the presence of a BOM as a severe problem for HTML or XML:
* if you are building the HTML file by combinining several plain-text
sources, you already must use a conversion filter for some characters
like "<" and "&" (or like "]]>" if you convert your plain-text into a
CDATA section of a anonymous text element); once you realize that a
conversion filter is necessary, and that the filter may need to be
contextual (for CDATA), I really wan't see where is the problem with
stripping a leading BOM from a plain-text file.

The only problem I see is for HTML/XML conformance: the BOM, if
interpreted as a character, would violate the document structure as it
would mean it is a text element before anything else (including before
the XML declaration, or the document's root element); for this reason,
XML parsers are detecting the presence of the BOM, and use it as a hint
regarding a UTF encoding (if this encoding is still not known before
when the perser is instanciated), and this codepoint is discarded (not
fed in the XML/HTML parser).

But if you have a BOM which looks like a UTF-* BOM in a document to be
parsed with a non UTF-* encoding, it is no more a BOM, but the encoding
of some character(s) at the begining of the document. When you detect
the XML declaration, if it specified a non-UTF encoding, the document
must be parsed again from the begining, and then there will be no BOM to
discard, but then the document will be non-conforming according to XML,
because an anonymous text element occurs before the (optional) XML
declaration or even before the root element!

In all XML parsers that I have seen, the presence of the BOM at the
begining of a UTF-* encoded document is not parsed as a character of a
text element, and is accepted even before the XML declaration. This is
convenient because it allows editing XML/HTML files with plain-text
editors that most often insert a BOM when saving files using a UTF-8 or
UTF-16 encoding.

I personnaly consider that the BOM has a great interest, notably since
that ZWNBSP is no longer used as a character and another character has
been defined for the same text semantic.

I think it's high time to consider ZWNBSP as a fully ignorable
character, even in the middle of the text (consider it like padding
nulls in old serial commnication protocols), whose role is clearly to be
used as a byte order mark and for the detection of the encoding
effectively used. So a process should be free to add or remove any
occurence of U+FEFF in a text stream without having to interpret it now
as a possible character.

This mail was checked for viruses by GFI MailSecurity.
GFI also develops anti-spam software (GFI MailEssentials), a fax server (GFI FAXmaker), and network security and management software (GFI LANguard) - www.gfi.com

Next message: Stephane Bortzmeyer: "Re: Unicode & space in programming & l10n"
Previous message: Jefsey_Morfin: "Re: Unicode & space in programming & l10n"
Maybe in reply to: Mark Cilia Vincenti: "Problem with SSI and BOM"
Next in thread: Philippe Verdy: "Re: Problem with SSI and BOM"
Reply: Philippe Verdy: "Re: Problem with SSI and BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Sep 27 2006 - 01:27:39 CST