Re: UTF-8 'BOM'

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Thu Jan 20 2005 - 12:38:05 CST

  • Next message: Antoine Leca: "Re: Subject: Re: 32'nd bit & UTF-8"

    On Thu, 20 Jan 2005 19:16:30 +0100, Hans Aberg wrote:
    >
    > On 2005/01/20 14:14, Christopher Fynn at cfynn@gmx.net wrote:
    >
    > > Hans Aberg wrote:
    > >
    > >
    > >> It is much better if the BOM is illegal in UTF-8. It does not prevent MS to
    > >> use it, instead labelling it as a file format marker for MS text files. A
    > >> program that then deals with MS text files must then know about the BOM and
    > >> remove it when and if appropriate. At the same time, it does not cause any
    > >> problems for programs that normally do not handle MS text files but only
    > >> plain text: They are fine as they are. Everyone should be able to be happy.
    > >
    > > Since BOM is a valid Unicode & ISO 110646 character and UTF-8 is a
    > > transformation format of Unicode & 10646, if BOM were illegal in UTF-8
    > > it couldn't be used for *all* Unicode characters.
    >
    > The BOM in UTF-8 is not the 0xFEFF UTF-8 encoded number, but 0xFEFF
    > appearing as though in UTF-16. 0xFEFF is Unicode number, and could be still
    > translated into UTF-8. So the BOM in UTF-8 is a really strange animal.
    >

    The BOM generated by Notepad and other Windows applications at the start of
    UTF-8 files is 0xEF 0xBB 0xBF, which is the UTF-8 transformation of the the
    valid Unicode character U+FEFF, and so no process that claims to process UTF-8
    files should have any problem. If you do get 0xFEFF at the start of (or anywhere
    in) a UTF-8 file, then that IS very wrong ... but I've never seen such an animal.

    Andrew



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 12:38:45 CST