Re: UTF-8 'BOM'

From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Thu Jan 20 2005 - 12:38:05 CST

Next message: Antoine Leca: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Rick McGowan: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Marcin 'Qrczak' Kowalczyk: "Re: UTF-8 'BOM'"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM'"
Reply: Hans Aberg: "Re: UTF-8 'BOM'"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thu, 20 Jan 2005 19:16:30 +0100, Hans Aberg wrote:
>
> On 2005/01/20 14:14, Christopher Fynn at cfynn@gmx.net wrote:
>
> > Hans Aberg wrote:
> >
> >
> >> It is much better if the BOM is illegal in UTF-8. It does not prevent MS to
> >> use it, instead labelling it as a file format marker for MS text files. A
> >> program that then deals with MS text files must then know about the BOM and
> >> remove it when and if appropriate. At the same time, it does not cause any
> >> problems for programs that normally do not handle MS text files but only
> >> plain text: They are fine as they are. Everyone should be able to be happy.
> >
> > Since BOM is a valid Unicode & ISO 110646 character and UTF-8 is a
> > transformation format of Unicode & 10646, if BOM were illegal in UTF-8
> > it couldn't be used for *all* Unicode characters.
>
> The BOM in UTF-8 is not the 0xFEFF UTF-8 encoded number, but 0xFEFF
> appearing as though in UTF-16. 0xFEFF is Unicode number, and could be still
> translated into UTF-8. So the BOM in UTF-8 is a really strange animal.
>

The BOM generated by Notepad and other Windows applications at the start of
UTF-8 files is 0xEF 0xBB 0xBF, which is the UTF-8 transformation of the the
valid Unicode character U+FEFF, and so no process that claims to process UTF-8
files should have any problem. If you do get 0xFEFF at the start of (or anywhere
in) a UTF-8 file, then that IS very wrong ... but I've never seen such an animal.

Andrew

Next message: Antoine Leca: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Rick McGowan: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Marcin 'Qrczak' Kowalczyk: "Re: UTF-8 'BOM'"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM'"
Reply: Hans Aberg: "Re: UTF-8 'BOM'"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 12:38:45 CST