Re: UTF-8 'BOM'

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 12:16:30 CST

  • Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/20 14:14, Christopher Fynn at cfynn@gmx.net wrote:

    > Hans Aberg wrote:
    >
    >
    >> It is much better if the BOM is illegal in UTF-8. It does not prevent MS to
    >> use it, instead labelling it as a file format marker for MS text files. A
    >> program that then deals with MS text files must then know about the BOM and
    >> remove it when and if appropriate. At the same time, it does not cause any
    >> problems for programs that normally do not handle MS text files but only
    >> plain text: They are fine as they are. Everyone should be able to be happy.
    >
    > Since BOM is a valid Unicode & ISO 110646 character and UTF-8 is a
    > transformation format of Unicode & 10646, if BOM were illegal in UTF-8
    > it couldn't be used for *all* Unicode characters.

    The BOM in UTF-8 is not the 0xFEFF UTF-8 encoded number, but 0xFEFF
    appearing as though in UTF-16. 0xFEFF is Unicode number, and could be still
    translated into UTF-8. So the BOM in UTF-8 is a really strange animal.



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 12:18:04 CST