Re: BOM's at Beginning of Web Pages?

From: Tex Texin (
Date: Tue Feb 18 2003 - 00:41:02 EST

  • Next message: Doug Ewell: "Re: Everson Mono"



    1) Please note I said the "UTF-8 BOM" was relatively recent, not the BOM.

    2) Yes, the more forms of encoding declaration that exist the more likely for
    conflicts to occur, or for there to be increasingly complex precedence rules.

    And yes, I am sure others have said it, but I also harp on the problem that
    http charset declarations are out of the control of web authors.

    3) Yes 2279 being outdated was besides the point. I knew there was a
    replacement in the works and couldn't find it with a search of rfc or google,
    so I was fishing for someone else to bring it up. I am glad Martin did.

    4) I thought your comment about vendors going beyond standards was refering to
    more than utf-8, but mostly I was pulling your chain a little. ;-)

    5) "I do think something may need to be done at Microsoft" LOL, thanks.

    6) "UTF-8 signatures are not evil" ok. In and of themselves, they are not.
    Mandating their use everywhere is evil. Notepad is broken in always outputting
    it, since notepad is used for files that are also not plain text. The rest of
    the world should not change because notepad is broken. There are plenty of
    files that
    have their encoding indicated by other means. Adding a UTF-8 BOM where they
    are not needed breaks existing software, filters as Martin mentioned, and adds
    ambiguity in many situations where there is no ambiguity.


    Doug Ewell wrote:
    > Tex Texin <tex at i18nguy dot com> wrote:
    > > 2) Much of this discussion seems to take place without looking at the
    > > timelines of the various docs.
    > > The UTF-8 BOM is relatively recent addition to Unicode. Further it is
    > > not necessary, (IE provides no information of value to the browser) so
    > > modifying the specs to include it hardly seems worthwhile.
    > The BOM-as-encoding-signature dates back to the publication of Unicode
    > 1.0, Volume 2 (p. 7) in 1992:

    > > 4) I don't see any real problems caused by the inconsistency of
    > > supporting a UTF-16 BOM and not supporting a UTF-8 BOM.
    > > Note that in HTML the BOM is only used to identify byte ordering. It
    > > is not used to indicate the encoding (unlike XML).
    > The HTML spec does say that, and that is a very good point. It is
    > frequently pointed out that UTF-8 does not need a byte order mark, which
    > is true but usually not relevant to discussions about using it as a
    > signature. But in the case of HTML, byte order really is the issue.
    > > 5) References to RFC 2279 are depressing. It is overdue for an update
    > > as it references 6 byte transformations.
    > This is beside the point of why Roozbeh and I mentioned it. (BTW, I
    > still prefer the RFC 2279 explanation of UTF-8 to anything I have seen
    > in the Unicode book or Web site.)
    > > 6) Doug you surprised me! I thought you were a supporter of
    > > standards... How can we have standards while recommending people
    > > modify their products to accommodate whatever characters or
    > > innovations suits them. The mistakes of browser vendors in the past
    > > is not a good justification for ad hoc changes today.
    > Well, I am a supporter of standards, and I thought I was suggesting only
    > a slight and relatively harmless bending of the HTML letter-of-the-law.
    > (The old maxim, "Be conservative in what you send and liberal in what
    > you accept.") I thought allowing an initial U+FEFF was far less
    > cavalier than some other things browsers do, and Deborah confirms that
    > browsers sometimes have to be liberal. But I concede that there is a
    > potential problem if the file starts with a UTF-8 signature and the
    > meta-charset declaration specifies something other than UTF-8.
    > I do think something may need to be done at Microsoft (I don't know
    > what) about the problem of Notepad writing UTF-8 files that contain a
    > signature and IE displaying them in an unexpected way. I don't think
    > Notepad is anybody's favorite editor in the world, but it's definitely
    > "good enough" for many purposes (not to mention free and ubiquitous).
    > The Notepad practice of automatically prepending a signature to UTF-8
    > files does has the major advantage that naïve users don't have to worry
    > about the file type when they load it back into Notepad later. The
    > "NESTLÉ®" problem shows that UTF-8 autodetection can't be guaranteed to
    > work 100% of the time. If a user saves a file and then reloads it and
    > it ends up corrupted because the autodetection failed, she may blame
    > Unicode rather than the editor. I'd say that by ensuring the success of
    > UTF-8 loads and saves, without requiring any intervention on the part of
    > the user, Notepad's UTF-8 signature convention may actually help spread
    > the use of Unicode among Windows users.
    > In summary,
    > 1. OK, you're right: HTML files in UTF-8 should not begin with a
    > signature.
    > 2. But that's only because the HTML spec says so, not because UTF-8
    > signatures are evil.
    > 3. Notepad writes UTF-8 files with signatures for a good reason.
    > 4. So for now at least, don't use Notepad for HTML. (Try SC UniPad
    > instead. :-)
    > 5. None of this applies to Unix or Linux systems, which can't handle
    > any type of file signature.
    > > Please tell me it was just a case of your not having had
    > > your morning coffee yet... ;-)
    > In my case it would be tea; but yes, maybe that was the problem.
    > -Doug Ewell
    > Fullerton, California

    Tex Texin   cell: +1 781 789 1898
    Xen Master                
    Making e-Business Work Around the World

    This archive was generated by hypermail 2.1.5 : Tue Feb 18 2003 - 01:25:37 EST