Re: UTF-8 text files

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Jun 04 2005 - 16:29:37 CDT

  • Next message: James Kass: "Re: Woleai (Caroline Islands script)"

    Lasse Kärkkäinen / Tronic <tronic2 at sci dot fi> wrote:

    > UTF-8 can be ASCII compatible, but using a BOM breaks this. I have
    > found out that some text editors use a BOM in every UTF-8 text file
    > they write and that some don't, but none of them allow the user the
    > choose. Those that use it, also tend to use it for identifying the
    > encoding, instead of checking the data for malformed UTF-8 and then
    > assuming some 8-bit encoding, or using system locale, or simply asking
    > the user. In practice the autodetection by malformed UTF-8 seems to
    > work quite reliably and it very rarely misdetects legacy 8-bit as
    > UTF-8 (in fact, I have never seen this happen).

    It's a contrived example, but the string "NESTLÉ™" encoded in Latin-1
    consists of the bytes 4E 45 53 54 4C C9 99. This is a valid UTF-8
    string, and SC UniPad detects it as such and renders it as "NESTLə".

    In addition to the Current Options menu item within SC UniPad, as Chris
    Jacobs mentioned, it's also possible to click on a certain section of
    the status bar to toggle between BOM and no-BOM. This is easier once
    you figure out where the section is.

    I hope Sharmahd Computing decides to release an update to UniPad
    someday. The Web site gets periodic minor updates, but the software
    hasn't been updated for over two and a half years.

    > While BOM serves as a good way for identifying file encoding (or
    > would, if everyone actually used it), it also causes significant
    > trouble to applications handling the files as ASCII. Using a BOM in a
    > shell script, for example, is not possible (the file must begin with
    > characters #!/, not something else). Using UTF-8 somewhere inside the
    > script, on the other hand, would be perfectly valid.

    The BOM causes "significant problems" only in the case of files that
    have a specific signature, like the shell scripts you mentioned, and
    only with applications like shells that are unaware of the BOM and don't
    auto-strip it. (Hint, hint for fellow developers.)

    > My question (or three of them) is: should a BOM generally be used in
    > text files or not? Or should everything just support text files with
    > and without BOM (so that the user selects which format to write)?
    > Which way to take if there is no user to make that selection
    > (automatic conversion tools, etc)?

    With the continued use of Windows Notepad, probably the most popular
    text editor that does what you describe, and the introduction of U+2060
    WORD JOINER to take over the non-BOM uses of U+FEFF, applications should
    (IMHO) be more tolerant of the possibility that a text will start with
    EF BB BF and deal with it appropriately.

    > By text file I refer to a ... Well, text file. Something that you
    > might edit with emacs or Notepad, that does not have any character
    > encoding info attached to it.

    All text files have a character encoding associated with them, by
    definition.

    --
    Doug Ewell
    Fullerton, California
    http://users.adelphia.net/~dewell/
    


    This archive was generated by hypermail 2.1.5 : Sat Jun 04 2005 - 16:33:21 CDT