RE: UTF-8 'BOM'

From: Addison Phillips [wM] ([email protected])
Date: Thu Jan 20 2005 - 13:10:29 CST

Next message: Richard T. Gillam: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Antoine Leca: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Hans Aberg: "Re: UTF-8 'BOM'"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM'"
Reply: Hans Aberg: "Re: UTF-8 'BOM'"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> The BOM in UTF-8 is not the 0xFEFF UTF-8 encoded number, but 0xFEFF
> appearing as though in UTF-16. 0xFEFF is Unicode number, and
> could be still
> translated into UTF-8. So the BOM in UTF-8 is a really strange animal.

I hesitate to feed the thread, but what the heck.

This is confusingly written, but I believe it is wrong.

The Unicode scalar value (for the BOM character) is U+FEFF. In UTF-8 this is encoded as the byte sequence:

0xEF 0xBB 0xBF

This is the byte sequence that Notepad writes at the start of UTF-8 files saved from that editor.

Given all the misinformation on this thread, I direct your attention to the FAQ:

http://www.unicode.org/faq/utf_bom.html#BOM

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Hans Aberg
> Sent: 2005年1月20日 10:17
> To: [email protected]; Unicode List
> Subject: Re: UTF-8 'BOM'
>
>
> On 2005/01/20 14:14, Christopher Fynn at [email protected] wrote:
>
> > Hans Aberg wrote:
> >
> >
> >> It is much better if the BOM is illegal in UTF-8. It does not
> prevent MS to
> >> use it, instead labelling it as a file format marker for MS
> text files. A
> >> program that then deals with MS text files must then know
> about the BOM and
> >> remove it when and if appropriate. At the same time, it does
> not cause any
> >> problems for programs that normally do not handle MS text
> files but only
> >> plain text: They are fine as they are. Everyone should be able
> to be happy.
> >
> > Since BOM is a valid Unicode & ISO 110646 character and UTF-8 is a
> > transformation format of Unicode & 10646, if BOM were illegal in UTF-8
> > it couldn't be used for *all* Unicode characters.
>
> The BOM in UTF-8 is not the 0xFEFF UTF-8 encoded number, but 0xFEFF
> appearing as though in UTF-16. 0xFEFF is Unicode number, and
> could be still
> translated into UTF-8. So the BOM in UTF-8 is a really strange animal.
>

Next message: Richard T. Gillam: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Antoine Leca: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Hans Aberg: "Re: UTF-8 'BOM'"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM'"
Reply: Hans Aberg: "Re: UTF-8 'BOM'"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 13:15:33 CST