RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Lars Kristan (
Date: Fri Jan 21 2005 - 16:41:27 CST

  • Next message: Jon Hanna: "RE: So how about U+D7FD for a NOP then?"

    Andy Heninger wrote:


    In this case, I am the user, since I use the C language to write software.
    Sorry for the ambiguity in my response.

    > Text files should be opened in text mode;
    > binary files should be opened in binary mode.
    > So says the applicable standards.

    I don't know much about the standards, but I suspect the standards are not
    presribing how to open files. They simply define _standard_ ways of doing it
    and _standard_ ways of specifying what to do.

    It is a pity one needs to decide on the type of the file before opening it.
    Apart from the extension (which is very unreliable), and application's
    expectations, there is no way to tell what the file really contains. Only
    when you open it can you start determining what it is. Sometimes there is a
    solution for that, but not always. And even when there is one, it is
    typically costly.

    Then there are other problems. You could argue perhaps that it is the
    application's expectation that counts. Well, I've wasted a lot of paper and
    time whenever I forgot to specify the /b in the copy command directed to the
    LPT. And it is just an example. There are many other similar problems. So
    many that I've started to like UNIX, even though I grew up with Microsoft.

    So, your philosopy is to distinguish text and binary data. Someone else's
    philosophy is to not do so. And they both work. And the two of you should
    agree that you disagree, but should both be given an equal chance to learn
    whether you're right or not.

    And this is where the Unicode standard is right. It allows the BOM in UTF-8
    but does not prescribe it. Where the UTC is not right is ... oh well, I've
    said it too many times already.


    This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 16:46:19 CST