RE: UTF-8 'BOM'

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Thu Jan 20 2005 - 13:10:29 CST

  • Next message: Richard T. Gillam: "Re: Subject: Re: 32'nd bit & UTF-8"

    > The BOM in UTF-8 is not the 0xFEFF UTF-8 encoded number, but 0xFEFF
    > appearing as though in UTF-16. 0xFEFF is Unicode number, and
    > could be still
    > translated into UTF-8. So the BOM in UTF-8 is a really strange animal.

    I hesitate to feed the thread, but what the heck.

    This is confusingly written, but I believe it is wrong.

    The Unicode scalar value (for the BOM character) is U+FEFF. In UTF-8 this is encoded as the byte sequence:

    0xEF 0xBB 0xBF

    This is the byte sequence that Notepad writes at the start of UTF-8 files saved from that editor.

    Given all the misinformation on this thread, I direct your attention to the FAQ:

    http://www.unicode.org/faq/utf_bom.html#BOM

    Addison P. Phillips
    Director, Globalization Architecture
    http://www.webMethods.com

    Chair, W3C Internationalization Working Group
    http://www.w3.org/International

    Internationalization is an architecture.
    It is not a feature.

    > -----Original Message-----
    > From: unicode-bounce@unicode.org
    > [mailto:unicode-bounce@unicode.org]On Behalf Of Hans Aberg
    > Sent: 2005年1月20日 10:17
    > To: cfynn@gmx.net; Unicode List
    > Subject: Re: UTF-8 'BOM'
    >
    >
    > On 2005/01/20 14:14, Christopher Fynn at cfynn@gmx.net wrote:
    >
    > > Hans Aberg wrote:
    > >
    > >
    > >> It is much better if the BOM is illegal in UTF-8. It does not
    > prevent MS to
    > >> use it, instead labelling it as a file format marker for MS
    > text files. A
    > >> program that then deals with MS text files must then know
    > about the BOM and
    > >> remove it when and if appropriate. At the same time, it does
    > not cause any
    > >> problems for programs that normally do not handle MS text
    > files but only
    > >> plain text: They are fine as they are. Everyone should be able
    > to be happy.
    > >
    > > Since BOM is a valid Unicode & ISO 110646 character and UTF-8 is a
    > > transformation format of Unicode & 10646, if BOM were illegal in UTF-8
    > > it couldn't be used for *all* Unicode characters.
    >
    > The BOM in UTF-8 is not the 0xFEFF UTF-8 encoded number, but 0xFEFF
    > appearing as though in UTF-16. 0xFEFF is Unicode number, and
    > could be still
    > translated into UTF-8. So the BOM in UTF-8 is a really strange animal.
    >



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 13:15:33 CST