Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 18:48:53 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    On 2005/01/19 22:33, Oliver Christ at oli@trados.com wrote:

    > I don't see a big difference between the UTF16 BOMs and the UTF8 one.
    > All signal that the file's encoding is Unicode, and specify which
    > "variant" is actually used.

    The problem is that UNIX computers do not use file contents for indicating
    file encoding. It screws up scripts and other essential OS data. see
    <http://www.cl.cam.ac.uk/~mgk25/unicode.html>.

    > It should also be relatively simple to pipe any input through e.g. GNU's
    > recode for encoding normalization to UTF16 or whatever so that only one
    > module (the recoder) needs to be aware of BOMs (and/or "sniffing"
    > heuristics). The stream models in Java and .Net implement exactly that.

    The problem is deeper than that, as it affect system software. Your are
    effectively asking for UNIX platforms to be adapted to handle MS OS
    problems. That is not fair.

    > Hans Aberg added:
    >
    >> It is clear that the use of a BOM in UTF-8 should properly be
    >> viewed as a file format, and not a character encoding format.
    >
    > That's not clear to me. I find UTF8 BOMs at the beginning of e.g. an
    > .html or .csv file pretty useful, equally useful to { 0xFE 0xFF } or {
    > 0xFF 0xFE } at the beginning of a file. I don't think it helps when
    > 'file' would report such files as "UTF8 encoded text written by Notepad
    > or .Net". But maybe I misunderstood your comment.

    It is a file format, in part because if one singles out a subsegment, you
    cannot tell which encoding it is. Different file formats use different
    leading markers. If Unicode would have supplied an escape character for file
    formats, then that could be used for special file formats, such as "UTF-8
    text" or "MS text". A plain text UTF-8 file would then not have any such
    marker. Thus, the UNIX operating systems would not have to be entirely
    rewritten in order to accommodate for UTF-8. If one discovers a file with
    the marker "UTF-8 text", then one could supply a program to treat that, as
    one does in WWW-browsers. So the BOM is useful to some as a file contents
    marker, but a major hurdle to others. But Unicode should not hurt anyone.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 18:49:46 CST