RE: Several BOMs in the same file

From: Marco Cimarosti (
Date: Tue Mar 25 2003 - 04:59:21 EST

  • Next message: Kent Karlsson: "RE: Several BOMs in the same file"

    Stefan Persson wrote:
    > Let's say that I have two files, namely file1 & file2, in any Unicode
    > encoding, both starting with a BOM, and I compile them into
    > one by using
    > cat file1 file2 > file3
    > in Unix or
    > copy file1 + file2 file3
    > in MS-DOS, file3 will have the following contents:
    > BOM
    > contents from file1
    > BOM
    > contents from file2
    > Is this in accordance with the Unicode standard, or do I have
    > to remove the second BOM?

    IMHO, Unicode should not specify such a behavior. Deciding what a shell
    command is supposed to do is a decision of the operating system, not of text
    encoding standards.

    BTW, consider that both Unix "cat" and DOS "copy" are not limited to Unicode
    text files. Actually, they are not even limited to text files at all: you
    could use them to concatenate a bitmap with a font with an HTML document
    with a spreadsheet... whether the result makes sense or not is up to you
    and/or to the applications that will process the resulting file.

    Probably, there should be two separate commands (or different options of the
    same command): to do a raw byte-by-byte concatenation, and to do an
    encoding-aware concatenation of text files.

    E.g., imagine a "cat" command with these extensions:

                    cat [ -... ] [ -R encoding ] { [ -F encoding ] file }
                    If neither -R or -F's are specified, the concatenation is
    done byte by byte.
                    -R specifies the encoding of the resulting *text* file;
                    -F specifies the encoding of the following *text* file.

    You command above would now expand to something like this:

            cat -R UTF-16 -F UTF-16LE file1 -F Big-5 file2 > file3

    Provided with information about the input encodings and the expected output
    encoding, "cat" could now correctly handle BOM's, endianness, new-line
    conventions, and even perform character set conversions. Without this extra
    info, "cat" would retain its good ol' byte-by-byte functionality.

    Similar options could be added to any Unix command potentially dealing with
    text files ("cp", "head", "tail", etc.), as well as to their equivalents in
    DOS or other operating systems.

    _ Marco

    This archive was generated by hypermail 2.1.5 : Tue Mar 25 2003 - 06:03:02 EST