Re: BOM's at Beginning of Web Pages?

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Feb 17 2003 - 01:05:37 EST

  • Next message: Doug Ewell: "Re: BOM's at Beginning of Web Pages? Mac IE's Euro"

    Roozbeh Pournader <roozbeh at sharif dot edu> wrote:

    > Found it! It's forbidden to start a HTML 4.0 page with a UTF-8 BOM.
    > Proof:
    > ...
    > That's all. So the only characters that are allowed in a HTML 4.0 web
    > page before the HTML header, are U+0009, U+000A, U+000C, U+000D,
    > U+0020, and U+200B. QED.

    I can't argue with the excellent gumshoe work Roozbeh did. But it does
    seem peculiar, as Michka observed, that ZWSP should be a legal white
    space character for this purpose but ZWNBSP should not; and as James
    noted, it may have been an oversight. (I would add to Michka's comment
    that it seems equally bizarre to allow U+000C FORM FEED at the start of
    an HTML file but not U+FEFF.)

    > PS: UTF-16 is an exception to that, since the BOM is not part of the
    > document and should be removed for processing.

    If this is true -- that U+FEFF is a kind of meta-character that doesn't
    really belong to the text per se -- then it should be equally true for
    UTF-8, whether its role is as a true Byte Order Mark (needed in UTF-16
    and UTF-32 but not UTF-8) or as a signature (potentially useful in all
    Unicode CES's). Only in its evil-twin role as a zero-width no-break
    space is it truly part of the text, in which case the previous
    discussion comments about white-space characters applies.

    Michael (michka) Kaplan <michka at trigeminal dot com> wrote:

    > Rather then treating HTML like the SQL standard (lofty goals that no
    > one company completely supports because it would be insane to do it!)
    > they can bend to the actual usage out there and just move on, right?

    Michka is probably right that Notepad is one of the more popular HTML
    editors out there, but even though I'm sure he didn't mean it this way,
    I would prefer not to say anything that can be twisted into "the HTML
    specification should be changed to match the way Microsoft does things."
    That is bound to bring all the Microsoft haters out of the woodwork.
    Rather, I would stress the inconsistency of allowing U+FEFF at the
    beginning of an HTML file encoded in UTF-16 but not in one encoded in
    the much more common UTF-8.

    > Of course if I had a penny for every byte that has been used
    > discussing these three bytes sometimes found at the beginning of a
    > UTF-8 document, I would not be working this weekend; I'd be somewhere
    > really warm and sunny.

    There is so much disagreement, confusion, and misunderstanding
    surrounding these three little bytes that I feel the discussion is
    completely warranted. (At least nobody can ever claim it's off topic!)

    Roozbeh responded:

    > Well, that needs researching into what UTF-8 is in W3C and HTML 4.0
    > terms:
    > ...
    > RFC 2279. A copy can be found at
    > <http://www.ietf.org/rfc/rfc2279.txt>, or any other place you like and
    > search for FEFF, BOM, ZERO WIDTH NO-BREAK SPACE, or the sequence "EF
    > BB BF" there. Nothing can be found.

    RFC 2279 defines and describes the technical structure of UTF-8. Usage
    issues surrounding U+FEFF as either a signature or a ZWNBSP would have
    been out of scope. Most Unicode and WG2 documents do not discuss the
    BOM either.

    Michka wrote back:

    > If the problem was indeed due to a BOM then the answer *is* to fix the
    > browser. Windows 2000 and XP have shipped onto a gazillion machines
    > and a lot of people make quick spot changes to HTML pages in notepad.
    > The BOM is here and any browser that cannot handle not showing either
    > a BOM or a ZBNBSP can be classed as a dumb one.

    Certainly, Microsoft is in a position to fix their own browser to make
    it tolerant of the BOM. If they ship a quick and handy editor that
    prepends a BOM to UTF-8 text files (which I think is a good idea, for
    the reasons James cited), and if people are using that editor for HTML
    files encoded in UTF-8, then their browser should behave sensibly when
    handed an HTML file with a leading BOM. Messing up the layout at the
    top of a page is not sensible, and displaying a Euro sign is just plain
    weird.

    But note that so far, all of the weirdness seems to be with IE 5.2 for
    Macintosh. I've never seen any of this with IE 5.5 or 6.0 for Windows.
    (Indeed, my Web pages all used to begin with BOMs and I never noticed a
    problem, but I removed the BOMs when Michael Everson told me they
    displayed badly on his Mac.) So it seems only the Mac version of IE
    needs "fixing."

    I don't see anything wrong with IE allowing a BOM at the start of
    UTF-8-encoded HTML files, even if it is not expressly allowed by the
    HTML specification. Browser vendors have certainly gone farther than
    that to "extend" the standard in the past; remember Netscape's notorious
    <blink> element? But I also think the HTML Working Group should
    consider explicitly allowing the BOM at the start of HTML files encoded
    in UTF-8. (Note that it is explicitly allowed in XML.)

    -Doug Ewell
     Fullerton, California



    This archive was generated by hypermail 2.1.5 : Mon Feb 17 2003 - 01:43:31 EST