Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Hans Aberg (
Date: Wed Jan 19 2005 - 17:51:30 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    At 21:15 +0100 2005/01/19, Lars Kristan wrote:
    >Hans Aberg wrote:
    >> On 2005/01/19 01:56, Peter Kirk at wrote:
    >> > On 19/01/2005 00:09, Hans Aberg wrote:
    >> >> UTF-8 BOM's seem pointless.
    >> > Maybe. Nevertheless, they exist, not only as a result of
    >> unintelligent
    >> > conversion from UTF-16 or UTF-32 to UTF-8, but also because
    >> at least one
    >> > UTF-8 editor, Notepad on Windows 2000 (and XP?), always
    >> emits a BOM at
    >> > the start of a UTF-8 file.
    >> Well, it seems easier to change that single editor, then. Or
    >> write a program
    >> that removes it at need.
    >At first, one would think that the UTF-8 'BOM' emitted by Notepad is an
    >oversight, a bug. But that is not the case.
    >A long time ago, Notepad worked on 8-bit legacy encoded files. Always in your
    current Windows codepage.
    >Then Notepad was rewritten in Unicode and got the ability to save files in
    >'Unicode' (UCS-2). When opening a file, it used the BOM to distinguish the two
    >flavors of text files.
    >Now Notepad got the ability to save UTF-8 files. And the UTF-8 'BOM' is emitted
    >for the same purpose - to be able to distinguish the UTF-8 files from legacy
    >encoded files. So, you always get the text you saved back, displayed properly.
    >But yes, you cannot use Notepad to edit UNIX files, or UTF-8 html files.
    >It's a question of what Notepad is - is it a plain text editor or is it an
    >editor for "Text documents"? From Microsoft perspective it's probably the
    >latter, since Windows practically doesn't have any text files at all. Except
    >those generated as "Text documents". For everything else (like html), you have

    It is clear that the program produces files in an inhouse file format for
    handling text, and not a plain text format. As the format is platform
    specific, when a file is transferred off the platform onto say Internet, the
    BOM should be removed in order to become plain text file. Unicode should
    have pointed this out to MS. One can compare this for example with Mac OS,
    which also uses additional resources to display file information such as
    file format, which program is used to handle it, etc. When such a file is
    transferred onto the Internet as plain text, all that extra data has to be
    removed. Unicode does not provide support for such extra file information
    for Mac OS, nor any other platform. So the MS OS should note be treated
    specially in this respect.

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 17:52:52 CST