Re: Subject: Re: 32'nd bit & UTF-8

From: Mark E. Shoulson (mark@kli.org)
Date: Thu Jan 20 2005 - 08:53:06 CST

  • Next message: Arcane Jill: "Re: 32'nd bit & UTF-8"

    Marcin 'Qrczak' Kowalczyk wrote:

    >"Oliver Christ" <oli@trados.com> writes:
    >
    >
    >
    >>On the very contrary. It's most helpful to determine a text file's
    >>encoding. Without the UTF8 BOM it's hard to tell whether a file is
    >>encoded in some ISO or whatever encoding/codepage or is already UTF8.
    >>
    >>
    >
    >The problem with BOM in UTF8 is that it must be specially handled by
    >all applications. It effectively turns UTF-8 into a stateful encoding
    >where the beginning of a "text stream" must be treated specially.
    >World would be simpler if UTF-8 BOM was banned.
    >
    >Fortunately I have never met a Unix program which used a UTF-8 BOM,
    >so I can mostly ignore the issue, except that text files coming from
    >Windows may have that annoying thing at the beginning which must be
    >stripped.
    >
    >
    That seems to be it; just a quick fix when needed.

     From what I can see, the real problem of BOMs is that they break the
    model of UTF-8 as a superset of ASCII (well, sorta). That is, if I take
    an ASCII-only file, load it into a Unicode-aware text editor, and then
    save it back as UTF-8, I would *expect* to have an ASCII-only file,
    since UTF-8 subsumes ASCII and I didn't change anything. But no,
    there's this little snippet of meta-data that got tacked on to the front
    of my actual data that suddenly takes me out of the ASCII realm. OK, I
    can see that as annoying, but probably not a show-stopper. The trouble
    really is that UNIX doesn't store encodings with its files, so a file
    might be expected to be ASCII or Latin-1 or binary or who-knows-what,
    and the applications that deal with it somehow have to figure it out or
    guess (possibly wrongly), while Microsoft's files are expected to be
    UTF-8 through and through (or so I am inferring, also probably wrongly).

    I'm not as anti-Microsoft as the next person: I'm actually quite a bit
    *more* anti-Microsoft than the next person. And yet at least in the
    case of the #! convention, I don't see why UNIX can't bend a little.
    Just check for '#!' *or* 'BOM#!' when you open a file for execution.

    ~mark



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 08:53:59 CST