From: Martin Duerst (duerst@w3.org)
Date: Mon Jan 24 2005 - 02:24:31 CST
At 03:18 05/01/20, Oliver Christ wrote:
>
>> UTF-8 BOM's seem pointless.
>
>On the very contrary. It's most helpful to determine a text file's
>encoding. Without the UTF8 BOM it's hard to tell whether a file is
>encoded in some ISO or whatever encoding/codepage or is already UTF8.
>I'm grateful every day that .Net by default prefixes UTF8-encoded text
>files with a UTF8 BOM, and IMO the UTF8 BOM should be part of the
>standard or at least be generally applied best practice. It simplifies
>at least part of the problem if you have to deal with thousands of files
>(or char strings [such as file names ;-) ], for that matter) of which
>you don't know the encoding.
>
>I agree that "byte order" is misleading in the case of UTF8 but in
>practice it's a blessing.
Two issues here:
1) The BOM only allows to distinguish between UTF-8 and ONE legacy encoding.
When transitioning from a local legacy encoding to UTF-8 in a limited
context, this will work. In a wider context (e.g. as soon as you have
to deal with more than one legacy encoding, e.g. iso-8859-1 and
iso-8859-2), it won't work anymore.
2) It's not true that the BOM it is necessary to determine that a file is
encoded as UTF-8. UTF-8 byte patterns are extremely specific, and extremely
rare in any other encoding. For details, see
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.
This may occasionally not work well for very small pieces of text,
but then I'm not aware of tools using the UTF-8 BOM for such small
pieces either.
I think that notepad actually is able to use the UTF-8 detection
algorithm, at least in some cases, but I might be wrong. If it doesn't
yet, it would be great if a future version did. It would also be great
if a future version provided a setting to be able to not produce a BOM.
Other programs do this already, the example I use is (not very
immaginatively) called notepad2 (see http://www.flos-freeware.ch/notepad2.html).
Besides being able to switch off the BOM for UTF-8, it also remembers
the encoding used to save files. That's of course again not a complete
solution for knowing the file encoding, but it also helps a lot.
Regards, Martin.
This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 19:27:35 CST