RE: Subject: Re: 32'nd bit & UTF-8

From: Martin Duerst ([email protected])
Date: Mon Jan 24 2005 - 02:24:31 CST

Next message: Martin Duerst: "<<NONCHAR>> for flex (was: Re: 32'nd bit & UTF-8)"

Previous message: Martin Duerst: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 03:18 05/01/20, Oliver Christ wrote:
>
>> UTF-8 BOM's seem pointless.
>
>On the very contrary. It's most helpful to determine a text file's
>encoding. Without the UTF8 BOM it's hard to tell whether a file is
>encoded in some ISO or whatever encoding/codepage or is already UTF8.
>I'm grateful every day that .Net by default prefixes UTF8-encoded text
>files with a UTF8 BOM, and IMO the UTF8 BOM should be part of the
>standard or at least be generally applied best practice. It simplifies
>at least part of the problem if you have to deal with thousands of files
>(or char strings [such as file names ;-) ], for that matter) of which
>you don't know the encoding.
>
>I agree that "byte order" is misleading in the case of UTF8 but in
>practice it's a blessing.

Two issues here:

1) The BOM only allows to distinguish between UTF-8 and ONE legacy encoding.
    When transitioning from a local legacy encoding to UTF-8 in a limited
    context, this will work. In a wider context (e.g. as soon as you have
    to deal with more than one legacy encoding, e.g. iso-8859-1 and
    iso-8859-2), it won't work anymore.

2) It's not true that the BOM it is necessary to determine that a file is
    encoded as UTF-8. UTF-8 byte patterns are extremely specific, and extremely
    rare in any other encoding. For details, see
    http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf.
    This may occasionally not work well for very small pieces of text,
    but then I'm not aware of tools using the UTF-8 BOM for such small
    pieces either.

I think that notepad actually is able to use the UTF-8 detection
algorithm, at least in some cases, but I might be wrong. If it doesn't
yet, it would be great if a future version did. It would also be great
if a future version provided a setting to be able to not produce a BOM.
Other programs do this already, the example I use is (not very
immaginatively) called notepad2 (see http://www.flos-freeware.ch/notepad2.html).
Besides being able to switch off the BOM for UTF-8, it also remembers
the encoding used to save files. That's of course again not a complete
solution for knowing the file encoding, but it also helps a lot.

Regards, Martin.

Next message: Martin Duerst: "<<NONCHAR>> for flex (was: Re: 32'nd bit & UTF-8)"
Previous message: Martin Duerst: "RE: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jan 24 2005 - 19:27:35 CST