Re: BOM's at Beginning of Web Pages?

From: Roozbeh Pournader (roozbeh@sharif.edu)
Date: Sat Feb 15 2003 - 21:25:18 EST

  • Next message: Michael \(michka\) Kaplan: "Re: BOM's at Beginning of Web Pages?"

    Found it! It's forbidden to start a HTML 4.0 page with a UTF-8 BOM. Proof:

    1. Open the main page of Unicode. You can see that the HTML header says:

       <!doctype HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><html>

    So, we are talking about HTML 4.0 here. The reference for HTML 4.0 is:

       http://www.w3.org/TR/1998/REC-html40-19980424/

    The section about HTML header is Section 7.1, Introduction to the
    structure of an HTML document:

       http://www.w3.org/TR/1998/REC-html40-19980424/struct/global.html#h-7.1

    which mentions:

      "An HTML 4.0 document is composed of three parts:

          1. a line containing HTML version information,
          2. a declarative header section (delimited by the HEAD element),
          3. a body, which contains the document's actual content. The body
             may be implemented by the BODY element or the FRAMESET element.

       White space (spaces, newlines, tabs, and comments) may appear before or
       after each section. Sections 2 and 3 should be delimited by the HTML
       element."

    So "White space" is allowed before the line containing HTML version
    information. But what is a white space? It is define in Section 9.1, White
    space:

      "The document character set includes a wide variety of white space
       characters. Many of these are typographic elements used in some
       applications to produce particular visual spacing effects. In HTML,
       only the following characters are defined as white space characters:

          * ASCII space (&#x0020;)
          * ASCII tab (&#x0009;)
          * ASCII form feed (&#x000C;)
          * Zero-width space (&#x200B;)
       
       Line breaks are also white space characters."

    So, we need to know what is a line break! Well, section 9.3.2 defines
    that:

      "A line break is defined to be a carriage return (&#x000D;), a line feed
       (&#x000A;), or a carriage return/line feed pair."

    That's all. So the only characters that are allowed in a HTML 4.0 web page
    before the HTML header, are U+0009, U+000A, U+000C, U+000D, U+0020, and
    U+200B. QED.

    roozbeh

    PS: UTF-16 is an exception to that, since the BOM is not part of the
    document and should be removed for processing.



    This archive was generated by hypermail 2.1.5 : Sat Feb 15 2003 - 22:10:42 EST