From: Martin Duerst (firstname.lastname@example.org)
Date: Mon Jan 24 2005 - 02:24:31 CST
At 03:18 05/01/20, Oliver Christ wrote:
>> UTF-8 BOM's seem pointless.
>On the very contrary. It's most helpful to determine a text file's
>encoding. Without the UTF8 BOM it's hard to tell whether a file is
>encoded in some ISO or whatever encoding/codepage or is already UTF8.
>I'm grateful every day that .Net by default prefixes UTF8-encoded text
>files with a UTF8 BOM, and IMO the UTF8 BOM should be part of the
>standard or at least be generally applied best practice. It simplifies
>at least part of the problem if you have to deal with thousands of files
>(or char strings [such as file names ;-) ], for that matter) of which
>you don't know the encoding.
>I agree that "byte order" is misleading in the case of UTF8 but in
>practice it's a blessing.
Two issues here:
1) The BOM only allows to distinguish between UTF-8 and ONE legacy encoding.
When transitioning from a local legacy encoding to UTF-8 in a limited
context, this will work. In a wider context (e.g. as soon as you have
to deal with more than one legacy encoding, e.g. iso-8859-1 and
iso-8859-2), it won't work anymore.
2) It's not true that the BOM it is necessary to determine that a file is
encoded as UTF-8. UTF-8 byte patterns are extremely specific, and extremely
rare in any other encoding. For details, see
This may occasionally not work well for very small pieces of text,
but then I'm not aware of tools using the UTF-8 BOM for such small
I think that notepad actually is able to use the UTF-8 detection
algorithm, at least in some cases, but I might be wrong. If it doesn't
yet, it would be great if a future version did. It would also be great
if a future version provided a setting to be able to not produce a BOM.
Other programs do this already, the example I use is (not very
immaginatively) called notepad2 (see http://www.flos-freeware.ch/notepad2.html).
Besides being able to switch off the BOM for UTF-8, it also remembers
the encoding used to save files. That's of course again not a complete
solution for knowing the file encoding, but it also helps a lot.
This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 19:27:35 CST