RE: Problem with SSI and BOM

From: Mark Cilia Vincenti (mark@gfi.com)
Date: Tue Sep 26 2006 - 02:07:32 CST

  • Next message: Jukka K. Korpela: "Re: non-IPA primary/secondary stress marks?"

    Thanks all for your answers. This email by Addison Phillips below
    summarizes everything neatly. I have 3 SSI includes, and each of them
    are breaking the page by putting in an empty line (tested under the
    latest versions of IE and Firefox).

    If the BOM wasn't being rendered, then it wouldn't have been a problem,
    but it is being rendered.

    Now, some of these SSIs will be edited by a number of users. I haven't
    yet found a text editor which always saves in UTF-8 AND without BOM, no
    matter what settings you have.

    Besides the fact that I'm limiting users on what editors they can use
    (also increasing the chance of human error), BOM has a very important
    use. In fact every text editor I tried would think that a file
    containing English language characters and saved without BOM is an ANSI
    file. In fact, when saved, they are identical.

    This poses a big problem. Here's a scenario: the SSI file is saved with
    English language characters and without BOM. The user opens up the file
    in his favourite text editor. The text editor assumes the file is ANSI.
    The user proceeds to add characters with accents in them (eg the name of
    a French person), and re-saves the file. Now, since the text editor
    opened the file as ANSI, most likely it will assume you want to save it
    as ANSI as well, so the default saving format is going to be ANSI.

    Conclusion: the BOM is important to have. Some text editors eg Notepad
    don't even allow you to save the file without it. But the BOM inside
    HTML code is also bad as it's putting in empty lines each time. I'm just
    wondering if there's a way I can apply the includes with some other
    means that recognises the BOM and doesn't include it as well.

    Best Regards,

    Mark Cilia Vincenti - Internal Developer - Marketing
    GFI Software - www.gfi.com

    -----Original Message-----
    From: Addison Phillips [mailto:addison@yahoo-inc.com]
    Sent: 22 September 2006 11:39 PM
    To: Jukka K. Korpela
    Cc: Mark Cilia Vincenti; unicode@unicode.org
    Subject: Re: Problem with SSI and BOM

    Sadly...

    See: http://www.w3.org/International/questions/qa-utf8-bom

    The BOM is often rendered in the page, throwing off other display
    elements. One common problem on Windows is the prevalence of editors
    (Notepad!!) that add the UTF-8 BOM to text files stored as "UTF-8".
    While one might expect this to act as a "no-op" character, in practice,
    it isn't.

    Addison

    Jukka K. Korpela wrote:
    > On Fri, 22 Sep 2006, Mark Cilia Vincenti wrote:
    >
    >> I'm using SSI to include UTF-8 encoded files within a UTF-encoded
    >> HTML page on IIS (Internet Information Services). The problem is that

    >> the byte order mark is not being stripped by the SSI parser,
    >> resulting in BOMs within the HTML body.
    >
    > Can't you just remove the BOM? It's not needed in UTF-8 encoded data.
    It
    > might be thought of as a "signature" from which it is possible to
    deduce
    > (guess) the encoding. But for HTML files, you can and should
    explicitly
    > specify the encoding in HTTP headers (when they are transmitted via
    > HTTP) or in <meta> tags or both.
    >
    > If you can't do that for some reason, and if you can't make the
    > inclusion mechanism remove the BOM, it shouldn't be an issue, since
    > within data,
    > BOM (U+FEFF, ZERO-WIDTH NON-BREAKING SPACE) should be treated as an
    > invisible character that "glues" the characters around it together for

    > the purposes of rendering, and this should normally do no harm. Is
    there
    > some reason to suspect that some browsers don't treat BOM either that
    > way or simply ignore it (which is usually the same thing, for contexts

    > where BOM would normally appear as a result of inclusion).
    >
    > See also the Unicode BOM FAQ,
    > http://www.unicode.org/unicode/faq/utf_bom.html
    >

    -- 
    Addison Phillips
    Globalization Architect -- Yahoo! Inc.
    Internationalization is an architecture.
    It is not a feature.
      
    This mail was checked for viruses by GFI MailSecurity. 
    GFI also develops anti-spam software (GFI MailEssentials), a fax server (GFI FAXmaker), and network security and management software (GFI LANguard) - www.gfi.com 
    


    This archive was generated by hypermail 2.1.5 : Tue Sep 26 2006 - 08:29:39 CST