Re: BOM's at Beginning of Web Pages?

From: Martin Duerst (duerst@w3.org)
Date: Mon Feb 17 2003 - 17:36:53 EST

Next message: Jungshik Shin: "Re: DBCS and Unicode 3.1"

Previous message: Doug Ewell: "Re: DBCS and Unicode 3.1"
In reply to: Roozbeh Pournader: "Re: BOM's at Beginning of Web Pages?"
Next in thread: Jonathan Coxhead: "Re: BOM's at Beginning of Web Pages?"
Reply: Jonathan Coxhead: "Re: BOM's at Beginning of Web Pages?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Some comments:

- If you can avoid it, don't use a BOM at the start of an UTF-8
HTML file. It will display nicely on more browsers.

- The W3C Validator http://validator.w3.org/ accepts the BOM for
   HTML 4.01, and also XHTML. It probably should produce a warning.
   It did when I originally added code to handle it. I have requested
   that it be added again.

- Adding a BOM/ZWNBSP to the whitespace definition is a bad idea,
   because it would allow a ZWNBSP in all kinds of places where
   not seeing a space would be confusing (e.g. between attributes).
   Also, HTML 4 is only being maintained, not being developed.

- That HTML 4.0 allows ZWSP () as whitespace in
   http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 is for
   line breaking/rendering reasons (Thai), within element content.
   This is in conflict with the whitespace definition for syntactic
   purposes, which is formally given at
   http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html and does
   not include ZWSP (). I have filed a request for
   clarification.

- RFC 2279 does not approve or disapprove of the BOM. Both Unicode
   and ISO 10646 allow the BOM as a signature for UTF-8. RFC 2079
   is being updated. See
   http://lists.w3.org/Archives/Public/ietf-charsets/2003JanMar/0209.html.

- For XML, a BOM at the start of UTF-8 is allowed by an erratum at
   http://www.w3.org/XML/xml-V10-2e-errata#E22. But similar to HTML,
   better to not start your XML files with a BOM, because there are
   XML parsers out there that don't like it (and this was okay at
   least until 2001-07-25).

- The BOM is both rather handy in a Windows/Notepad scenario and
   seriously disruptive in an Unix-like filter scenario (which may
   also be on Windows). I have found that Notepad doesn't need the
   BOM to detect that a file is UTF-8 if it has enough other information
   (this is on a Japanese Win2000, your milage may vary). It would be
   nice if it had a setting to not produce a BOM.

- I append a small perl program that removes an UTF-8 BOM if there
   is one. Quite handy, I use it regularly. Feel free to use and change
   on your own responsibility.
   (i.e. if starts to eat up your files, don't blame me!)

Regards, Martin.

#!/usr/bin/perl

# program to remove a leading UTF-8 BOM from a file
# works both STDIN -> STDOUT and on the spot (with filename as argument)

if ($#ARGV > 0) {
print STDERR "Too many arguments!\n";
exit;
}

my @file; # file content
my $lineno = 0;

my $filename = $ARGV[0];
if ($filename) {
     open BOMFILE, "$filename";
     while (<BOMFILE>) {
         if (!$lineno++) {
             s/^\xEF\xBB\xBF//;
         }
         push @file, $_ ;
     }
     close BOMFILE;
     open NOBOMFILE, ">$filename";
     foreach $line (@file) {
         print NOBOMFILE $line;
     }
     close NOBOMFILE;
}
else { # STDIN -> STDOUT
     while (<>) {
         if (!$lineno++) {
             s/^\xEF\xBB\xBF//;
         }
         push @file, $_ ;
     }
     foreach $line (@file) {
         print $line;
     }
}

Next message: Jungshik Shin: "Re: DBCS and Unicode 3.1"
Previous message: Doug Ewell: "Re: DBCS and Unicode 3.1"
In reply to: Roozbeh Pournader: "Re: BOM's at Beginning of Web Pages?"
Next in thread: Jonathan Coxhead: "Re: BOM's at Beginning of Web Pages?"
Reply: Jonathan Coxhead: "Re: BOM's at Beginning of Web Pages?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Feb 17 2003 - 20:27:32 EST