Re: BOM's at Beginning of Web Pages?

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Feb 18 2003 - 01:30:04 EST

  • Next message: Erik.Ostermueller@alltel.com: "Re: DBCS and Unicode 3.1"

    Tex Texin <tex at i18nguy dot com> wrote:

    > 6) "UTF-8 signatures are not evil" ok. In and of themselves, they are
    > not. Mandating their use everywhere is evil. Notepad is broken in
    > always outputting it, since notepad is used for files that are also
    > not plain text. The rest of the world should not change because
    > notepad is broken. There are plenty of files that have their encoding
    > indicated by other means. Adding a UTF-8 BOM where they are not needed
    > breaks existing software, filters as Martin mentioned, and adds
    > ambiguity in many situations where there is no ambiguity.

    This is the tricky part, IMHO.

    Notepad is *intended* for plain text. The fact that many people use
    Notepad for HTML, something for which it wasn't really intended, isn't
    necessarily a defect in Notepad. In fact, MS has always presented
    Notepad as a really, really stripped-down editor, almost a toy, and
    pointed users toward WordPad (and before that, Write) if they wanted to
    get serious work done. (I'm sure MS would prefer we use FrontPage or
    Word to create Web pages.) I was frankly surprised that they upgraded
    Notepad in Windows 2000 to support UTF-8 and UTF-16.

    MS could easily "fix" Notepad to make the writing of UTF-8 signatures a
    user-controllable option, as it is in SC UniPad. But as I wrote
    earlier, removing the signature would mean that Notepad would have to
    rely on either (a) autodetection or (b) user intervention and knowledge
    in order to work with UTF-8. This is OK for me, you, and everyone else
    on the Unicode mailing list; we understand that there are different
    encoding schemes and that we may need to intervene to resolve
    differences or errors. But I'm not sure the average Notepad user
    understands all this and can deal with it without blaming Unicode.
    Don't forget, there are many users who think "Unicode" means "files get
    twice as big."

    There will probably be a day when Windows works almost exclusively with
    UTF-8 files, rather than files encoded in local 8-bit code pages, and
    the heuristic will be reduced to:

    1. If any invalid UTF-8 sequences exist, assume local code page.
    2. Else, assume UTF-8.

    When that day comes, it will be safe for Windows to jettison the UTF-8
    signature.

    -Doug Ewell
     Fullerton, California



    This archive was generated by hypermail 2.1.5 : Tue Feb 18 2003 - 02:20:35 EST