RE: Subject: Re: 32'nd bit & UTF-8

From: Martin Duerst (duerst@w3.org)
Date: Mon Jan 24 2005 - 02:24:31 CST

  • Next message: Martin Duerst: "<<NONCHAR>> for flex (was: Re: 32'nd bit & UTF-8)"

    At 03:18 05/01/20, Oliver Christ wrote:
    >
    >> UTF-8 BOM's seem pointless.
    >
    >On the very contrary. It's most helpful to determine a text file's
    >encoding. Without the UTF8 BOM it's hard to tell whether a file is
    >encoded in some ISO or whatever encoding/codepage or is already UTF8.
    >I'm grateful every day that .Net by default prefixes UTF8-encoded text
    >files with a UTF8 BOM, and IMO the UTF8 BOM should be part of the
    >standard or at least be generally applied best practice. It simplifies
    >at least part of the problem if you have to deal with thousands of files
    >(or char strings [such as file names ;-) ], for that matter) of which
    >you don't know the encoding.
    >
    >I agree that "byte order" is misleading in the case of UTF8 but in
    >practice it's a blessing.

    Two issues here:

    1) The BOM only allows to distinguish between UTF-8 and ONE legacy encoding.
        When transitioning from a local legacy encoding to UTF-8 in a limited
        context, this will work. In a wider context (e.g. as soon as you have
        to deal with more than one legacy encoding, e.g. iso-8859-1 and
        iso-8859-2), it won't work anymore.

    2) It's not true that the BOM it is necessary to determine that a file is
        encoded as UTF-8. UTF-8 byte patterns are extremely specific, and extremely
        rare in any other encoding. For details, see
        http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.
        This may occasionally not work well for very small pieces of text,
        but then I'm not aware of tools using the UTF-8 BOM for such small
        pieces either.

    I think that notepad actually is able to use the UTF-8 detection
    algorithm, at least in some cases, but I might be wrong. If it doesn't
    yet, it would be great if a future version did. It would also be great
    if a future version provided a setting to be able to not produce a BOM.
    Other programs do this already, the example I use is (not very
    immaginatively) called notepad2 (see http://www.flos-freeware.ch/notepad2.html).
    Besides being able to switch off the BOM for UTF-8, it also remembers
    the encoding used to save files. That's of course again not a complete
    solution for knowing the file encoding, but it also helps a lot.

    Regards, Martin.



    This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 19:27:35 CST