UTF-8 text files

From: Lasse Kärkkäinen / Tronic (tronic2@sci.fi)
Date: Fri Jun 03 2005 - 08:26:50 CDT

Next message: Rick McGowan: "Re: Arabic Joining Classes"

Previous message: Mike Hao: "Re: XML attribute normalization and Unicode in C language"
Next in thread: Chris Jacobs: "Re: UTF-8 text files"
Reply: Chris Jacobs: "Re: UTF-8 text files"
Reply: Doug Ewell: "Re: UTF-8 text files"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: UTF-8 text files"
Maybe reply: Philippe VERDY: "Re: UTF-8 text files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi,

UTF-8 can be ASCII compatible, but using a BOM breaks this. I have found out
that some text editors use a BOM in every UTF-8 text file they write and
that some don't, but none of them allow the user the choose. Those that use
it, also tend to use it for identifying the encoding, instead of checking
the data for malformed UTF-8 and then assuming some 8-bit encoding, or using
system locale, or simply asking the user. In practice the autodetection by
malformed UTF-8 seems to work quite reliably and it very rarely misdetects
legacy 8-bit as UTF-8 (in fact, I have never seen this happen).

While BOM serves as a good way for identifying file encoding (or would, if
everyone actually used it), it also causes significant trouble to
applications handling the files as ASCII. Using a BOM in a shell script, for
example, is not possible (the file must begin with characters #!/, not
something else). Using UTF-8 somewhere inside the script, on the other hand,
would be perfectly valid.

My question (or three of them) is: should a BOM generally be used in text
files or not? Or should everything just support text files with and without
BOM (so that the user selects which format to write)? Which way to take if
there is no user to make that selection (automatic conversion tools, etc)?

By text file I refer to a ... Well, text file. Something that you might edit
with emacs or Notepad, that does not have any character encoding info
attached to it.

- Tronic -

application/pgp-signature attachment: OpenPGP digital signature

Next message: Rick McGowan: "Re: Arabic Joining Classes"
Previous message: Mike Hao: "Re: XML attribute normalization and Unicode in C language"
Next in thread: Chris Jacobs: "Re: UTF-8 text files"
Reply: Chris Jacobs: "Re: UTF-8 text files"
Reply: Doug Ewell: "Re: UTF-8 text files"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: UTF-8 text files"
Maybe reply: Philippe VERDY: "Re: UTF-8 text files"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 03 2005 - 10:44:40 CDT