Re: Subject: Re: 32'nd bit & UTF-8

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Wed Jan 19 2005 - 16:57:56 CST

  • Next message: Christopher Fynn: "Re: 32'nd bit & UTF-8"

    "Oliver Christ" <oli@trados.com> writes:

    > Which is just the same for any other BOM or an encoding specification in
    > HTML's META element (which is much worse as you need to read quite some
    > content before you know the encoding in which to actually read).

    As Hans Aberg said, BOM is usable in a file format (even if
    inconvenient), but it makes little sense on the level of encoding
    because "beginning of text stream" is an ambiguous concept.

    Consider a program which reads a list of filenames to process from a
    file, such as the -X / --exclude-from option of tar. Should it support
    the case of the file starting with a BOM? Note that currently it
    doesn't recode the filenames at all, because a filename is technically
    an almost arbitrary sequence of bytes. If a user edits the list of
    files, a text editor inserts a BOM, and tar doesn't exclude a file
    because its real filename doesn't have a BOM, whose fault it is?

    The same question applies to fgrep with a list of patterns to search
    for in a text file (one pattern per line). Now, if it starts with a
    BOM, does the user want to search for a BOM, or is it a marker to be
    stripped?

    The diff program compares two text files and produces a text file
    which describes the differences in a precise format suitable for
    applying the differences to one of the files to obtain the other
    (it's suitable only for text files). The format includes lines of
    the original files prefixed by characters like a space, plus sign or
    minus sign. What it should do with a BOM? If it treats them as other
    characters, they will be put in the middle of lines. But the files
    with differences are also human-readable, not only machine-readable.
    What should a text editor do with a BOM in the middle of a line?
    And if diff stripped the BOM, it would lose information; how should
    it describe the differences between two files which differ only by
    the presence of a BOM?

    Unix programs tend to treat a BOM in the same way as a CR before a LF.
    If the programer took care to let it recognize a Windows convention,
    then it will understand the file (it will not necessarily recreate CR
    on output), but by default, without implementing special support, CR
    will be treated as a strange whitespace. Internet protocols which
    specify CR before a LF are of course supported, but file formats based
    on text files generally use LF only. Similarly for BOM: in most
    programs where it doesn't just become harmless naturally it will
    be treated as a strange character at the beginning.

    > I don't see a big difference between the UTF16 BOMs and the UTF8 one.
    > All signal that the file's encoding is Unicode, and specify which
    > "variant" is actually used.

    UTF-16 is not used as a format of text files on Unix because it's
    incompatible with ASCII. A UTF-8 BOM cannot be supported by a C
    compiler in the same way as a UTF-16 BOM (I mean reading the C
    source), because UTF-16 is not supported by a C compiler at all.

    UTF-16 is used inside Java, inside some databases, and inside some
    library APIs (e.g. Qt). I have *never* met a UTF-16-encoded standalone
    file, while UTF-8 is common and becomes more and more common today.

    C APIs generally either assume the locale's default encoding (e.g.
    localized error messages returned by strerror), or use UTF-8 (e.g.
    Gtk+), or use wchar_t which is UTF-32 on Unix (e.g. wide character
    variant of the curses library). UTF-32 is only in memory, it never
    happens in files.

    -- 
       __("<         Marcin Kowalczyk
       \__/       qrczak@knm.org.pl
        ^^     http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 16:58:51 CST