UTF-8 text files

From: Lasse Kärkkäinen / Tronic (tronic2@sci.fi)
Date: Fri Jun 03 2005 - 08:26:50 CDT

  • Next message: Rick McGowan: "Re: Arabic Joining Classes"


    UTF-8 can be ASCII compatible, but using a BOM breaks this. I have found out
    that some text editors use a BOM in every UTF-8 text file they write and
    that some don't, but none of them allow the user the choose. Those that use
    it, also tend to use it for identifying the encoding, instead of checking
    the data for malformed UTF-8 and then assuming some 8-bit encoding, or using
    system locale, or simply asking the user. In practice the autodetection by
    malformed UTF-8 seems to work quite reliably and it very rarely misdetects
    legacy 8-bit as UTF-8 (in fact, I have never seen this happen).

    While BOM serves as a good way for identifying file encoding (or would, if
    everyone actually used it), it also causes significant trouble to
    applications handling the files as ASCII. Using a BOM in a shell script, for
    example, is not possible (the file must begin with characters #!/, not
    something else). Using UTF-8 somewhere inside the script, on the other hand,
    would be perfectly valid.

    My question (or three of them) is: should a BOM generally be used in text
    files or not? Or should everything just support text files with and without
    BOM (so that the user selects which format to write)? Which way to take if
    there is no user to make that selection (automatic conversion tools, etc)?

    By text file I refer to a ... Well, text file. Something that you might edit
    with emacs or Notepad, that does not have any character encoding info
    attached to it.

    - Tronic -

    This archive was generated by hypermail 2.1.5 : Fri Jun 03 2005 - 10:44:40 CDT