RE: Subject: Re: 32'nd bit & UTF-8

From: Martin Duerst (
Date: Mon Jan 24 2005 - 02:24:31 CST

  • Next message: Martin Duerst: "<<NONCHAR>> for flex (was: Re: 32'nd bit & UTF-8)"

    At 03:18 05/01/20, Oliver Christ wrote:
    >> UTF-8 BOM's seem pointless.
    >On the very contrary. It's most helpful to determine a text file's
    >encoding. Without the UTF8 BOM it's hard to tell whether a file is
    >encoded in some ISO or whatever encoding/codepage or is already UTF8.
    >I'm grateful every day that .Net by default prefixes UTF8-encoded text
    >files with a UTF8 BOM, and IMO the UTF8 BOM should be part of the
    >standard or at least be generally applied best practice. It simplifies
    >at least part of the problem if you have to deal with thousands of files
    >(or char strings [such as file names ;-) ], for that matter) of which
    >you don't know the encoding.
    >I agree that "byte order" is misleading in the case of UTF8 but in
    >practice it's a blessing.

    Two issues here:

    1) The BOM only allows to distinguish between UTF-8 and ONE legacy encoding.
        When transitioning from a local legacy encoding to UTF-8 in a limited
        context, this will work. In a wider context (e.g. as soon as you have
        to deal with more than one legacy encoding, e.g. iso-8859-1 and
        iso-8859-2), it won't work anymore.

    2) It's not true that the BOM it is necessary to determine that a file is
        encoded as UTF-8. UTF-8 byte patterns are extremely specific, and extremely
        rare in any other encoding. For details, see
        This may occasionally not work well for very small pieces of text,
        but then I'm not aware of tools using the UTF-8 BOM for such small
        pieces either.

    I think that notepad actually is able to use the UTF-8 detection
    algorithm, at least in some cases, but I might be wrong. If it doesn't
    yet, it would be great if a future version did. It would also be great
    if a future version provided a setting to be able to not produce a BOM.
    Other programs do this already, the example I use is (not very
    immaginatively) called notepad2 (see
    Besides being able to switch off the BOM for UTF-8, it also remembers
    the encoding used to save files. That's of course again not a complete
    solution for knowing the file encoding, but it also helps a lot.

    Regards, Martin.

    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 19:27:35 CST