UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Jan 19 2005 - 14:15:18 CST

  • Next message: Kenneth Whistler: "Not enough code points (was: Re: 32'nd bit & UTF-8)"

    Hans Aberg wrote:
    > On 2005/01/19 01:56, Peter Kirk at peterkirk@qaya.org wrote:
    > > On 19/01/2005 00:09, Hans Aberg wrote:
    > >> UTF-8 BOM's seem pointless.
    > > Maybe. Nevertheless, they exist, not only as a result of
    > unintelligent
    > > conversion from UTF-16 or UTF-32 to UTF-8, but also because
    > at least one
    > > UTF-8 editor, Notepad on Windows 2000 (and XP?), always
    > emits a BOM at
    > > the start of a UTF-8 file.
    > Well, it seems easier to change that single editor, then. Or
    > write a program
    > that removes it at need.

    At first, one would think that the UTF-8 'BOM' emitted by Notepad is an
    oversight, a bug. But that is not the case.

    A long time ago, Notepad worked on 8-bit legacy encoded files. Always in
    your current Windows codepage.

    Then Notepad was rewritten in Unicode and got the ability to save files in
    'Unicode' (UCS-2). When opening a file, it used the BOM to distinguish the
    two flavors of text files.

    Now Notepad got the ability to save UTF-8 files. And the UTF-8 'BOM' is
    emitted for the same purpose - to be able to distinguish the UTF-8 files
    from legacy encoded files. So, you always get the text you saved back,
    displayed properly. But yes, you cannot use Notepad to edit UNIX files, or
    UTF-8 html files.

    It's a question of what Notepad is - is it a plain text editor or is it an
    editor for "Text documents"? From Microsoft perspective it's probably the
    latter, since Windows practically doesn't have any text files at all. Except
    those generated as "Text documents". For everything else (like html), you
    have tools.

    Not that I agree with that approach or like the consequences, but that is
    what they probably had in mind.


    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 14:16:08 CST