Re: BOM's at Beginning of Web Pages?

From: Martin Duerst (duerst@w3.org)
Date: Mon Feb 17 2003 - 17:36:53 EST

  • Next message: Jungshik Shin: "Re: DBCS and Unicode 3.1"

    Some comments:

    - If you can avoid it, don't use a BOM at the start of an UTF-8
       HTML file. It will display nicely on more browsers.

    - The W3C Validator http://validator.w3.org/ accepts the BOM for
       HTML 4.01, and also XHTML. It probably should produce a warning.
       It did when I originally added code to handle it. I have requested
       that it be added again.

    - Adding a BOM/ZWNBSP to the whitespace definition is a bad idea,
       because it would allow a ZWNBSP in all kinds of places where
       not seeing a space would be confusing (e.g. between attributes).
       Also, HTML 4 is only being maintained, not being developed.

    - That HTML 4.0 allows ZWSP (​) as whitespace in
       http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 is for
       line breaking/rendering reasons (Thai), within element content.
       This is in conflict with the whitespace definition for syntactic
       purposes, which is formally given at
       http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html and does
       not include ZWSP (​). I have filed a request for
       clarification.

    - RFC 2279 does not approve or disapprove of the BOM. Both Unicode
       and ISO 10646 allow the BOM as a signature for UTF-8. RFC 2079
       is being updated. See
       http://lists.w3.org/Archives/Public/ietf-charsets/2003JanMar/0209.html.

    - For XML, a BOM at the start of UTF-8 is allowed by an erratum at
       http://www.w3.org/XML/xml-V10-2e-errata#E22. But similar to HTML,
       better to not start your XML files with a BOM, because there are
       XML parsers out there that don't like it (and this was okay at
       least until 2001-07-25).

    - The BOM is both rather handy in a Windows/Notepad scenario and
       seriously disruptive in an Unix-like filter scenario (which may
       also be on Windows). I have found that Notepad doesn't need the
       BOM to detect that a file is UTF-8 if it has enough other information
       (this is on a Japanese Win2000, your milage may vary). It would be
       nice if it had a setting to not produce a BOM.

    - I append a small perl program that removes an UTF-8 BOM if there
       is one. Quite handy, I use it regularly. Feel free to use and change
       on your own responsibility.
       (i.e. if starts to eat up your files, don't blame me!)

    Regards, Martin.

    #!/usr/bin/perl

    # program to remove a leading UTF-8 BOM from a file
    # works both STDIN -> STDOUT and on the spot (with filename as argument)

    if ($#ARGV > 0) {
         print STDERR "Too many arguments!\n";
         exit;
    }

    my @file; # file content
    my $lineno = 0;

    my $filename = $ARGV[0];
    if ($filename) {
         open BOMFILE, "$filename";
         while (<BOMFILE>) {
             if (!$lineno++) {
                 s/^\xEF\xBB\xBF//;
             }
             push @file, $_ ;
         }
         close BOMFILE;
         open NOBOMFILE, ">$filename";
         foreach $line (@file) {
             print NOBOMFILE $line;
         }
         close NOBOMFILE;
    }
    else { # STDIN -> STDOUT
         while (<>) {
             if (!$lineno++) {
                 s/^\xEF\xBB\xBF//;
             }
             push @file, $_ ;
         }
         foreach $line (@file) {
             print $line;
         }
    }



    This archive was generated by hypermail 2.1.5 : Mon Feb 17 2003 - 20:27:32 EST