Re: BOM's at Beginning of Web Pages?

From: Tex Texin (tex@i18nguy.com)
Date: Mon Feb 17 2003 - 05:30:42 EST

  • Next message: William Overington: "XML and tags (LONG) (derives from Re: Plane 14 Tag Deprecation Issue)"

    Dudes and Dudettes,

    Not sure I read all of the thread, but:

    1) BOM is not only allowed but recommended in HTML UTF-16 documents.
    see section 5.1
    http://www.w3.org/TR/REC-html40/charset.html

    I am not sure what the comment about removing BOM is referring to. Is that
    someone's explanation or is it in the standard somewhere?

    2) Much of this discussion seems to take place without looking at the
    timelines of the various docs.
    The UTF-8 BOM is relatively recent addition to Unicode. Further it is not
    necessary, (IE provides no information of value to the browser) so modifying
    the specs to include it hardly seems worthwhile.

    3) Good idea about not bringing out the Microsoft haters.
    The argument itself is weak enough to be laughable. Driving specifications
    based on notepad behavior indeed.

    4) I don't see any real problems caused by the inconsistency of supporting a
    UTF-16 BOM and not supporting a UTF-8 BOM.
    Note that in HTML the BOM is only used to identify byte ordering. It is not
    used to indicate the encoding (unlike XML).

    There are already 2 legal ways to declare an encoding HTTP, and the META
    content-type statement (ignoring the generally unsupported ANCHOR charset for
    links). We do not need a UTF-8 BOM which neither declares an encoding nor
    identifies a serialization.

    5) References to RFC 2279 are depressing. It is overdue for an update as it
    references 6 byte transformations.

    6) Doug you surprised me! I thought you were a supporter of standards... How
    can we have standards while recommending people modify their products to
    accommodate whatever characters or innovations suits them. The mistakes of
    browser vendors in the past is not a good justification for ad hoc changes
    today.
    Just as with early Unicode there were some difficulties doing everything you
    needed with the web standards. Those days are gone. Let's insist vendors
    comply with both W3C and Unicode standards, AS WRITTEN, or the world gets to
    be an ugly place to develop software in. I like having one set of web pages
    that work on multiple browsers and not having to do separate pages for
    different browsers. Please tell me it was just a case of your not having had
    your morning coffee yet... ;-)

    tex

    Doug Ewell wrote:
      (I would add to Michka's comment
    > that it seems equally bizarre to allow U+000C FORM FEED at the start of
    > an HTML file but not U+FEFF.)
    >
    > > PS: UTF-16 is an exception to that, since the BOM is not part of the
    > > document and should be removed for processing.
    >
    > If this is true -- that U+FEFF is a kind of meta-character that doesn't
    > really belong to the text per se -- then it should be equally true for
    > UTF-8, whether its role is as a true Byte Order Mark (needed in UTF-16
    > and UTF-32 but not UTF-8) or as a signature (potentially useful in all
    > Unicode CES's). Only in its evil-twin role as a zero-width no-break
    > space is it truly part of the text, in which case the previous
    > discussion comments about white-space characters applies.
    >
    > Michka is probably right that Notepad is one of the more popular HTML
    > editors out there, but even though I'm sure he didn't mean it this way,
    > I would prefer not to say anything that can be twisted into "the HTML
    > specification should be changed to match the way Microsoft does things."
    > That is bound to bring all the Microsoft haters out of the woodwork.
    > Rather, I would stress the inconsistency of allowing U+FEFF at the
    > beginning of an HTML file encoded in UTF-16 but not in one encoded in
    > the much more common UTF-8.

    > Roozbeh responded:

    > RFC 2279 defines and describes the technical structure of UTF-8. Usage
    > issues surrounding U+FEFF as either a signature or a ZWNBSP would have
    > been out of scope. Most Unicode and WG2 documents do not discuss the
    > BOM either.
    >
    Doug startles me with:

    > I don't see anything wrong with IE allowing a BOM at the start of
    > UTF-8-encoded HTML files, even if it is not expressly allowed by the
    > HTML specification. Browser vendors have certainly gone farther than
    > that to "extend" the standard in the past; remember Netscape's notorious
    > <blink> element? But I also think the HTML Working Group should
    > consider explicitly allowing the BOM at the start of HTML files encoded
    > in UTF-8. (Note that it is explicitly allowed in XML.)
    >
    > -Doug Ewell
    > Fullerton, California

    -- 
    -------------------------------------------------------------
    Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
    Xen Master                          http://www.i18nGuy.com
                             
    XenCraft		            http://www.XenCraft.com
    Making e-Business Work Around the World
    -------------------------------------------------------------
    


    This archive was generated by hypermail 2.1.5 : Mon Feb 17 2003 - 06:32:24 EST