RE: Several BOMs in the same file

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Mar 25 2003 - 04:59:21 EST

Next message: Kent Karlsson: "RE: Several BOMs in the same file"

Previous message: Eric Rasmussen: "Re: CJK question"
Maybe in reply to: Stefan Persson: "Several BOMs in the same file"
Next in thread: Kent Karlsson: "RE: Several BOMs in the same file"
Reply: Kent Karlsson: "RE: Several BOMs in the same file"
Reply: Pim Blokland: "Re: Several BOMs in the same file"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Stefan Persson wrote:
> Let's say that I have two files, namely file1 & file2, in any Unicode
> encoding, both starting with a BOM, and I compile them into
> one by using
>
> cat file1 file2 > file3
>
> in Unix or
>
> copy file1 + file2 file3
>
> in MS-DOS, file3 will have the following contents:
>
> BOM
> contents from file1
> BOM
> contents from file2
>
> Is this in accordance with the Unicode standard, or do I have
> to remove the second BOM?

IMHO, Unicode should not specify such a behavior. Deciding what a shell
command is supposed to do is a decision of the operating system, not of text
encoding standards.

BTW, consider that both Unix "cat" and DOS "copy" are not limited to Unicode
text files. Actually, they are not even limited to text files at all: you
could use them to concatenate a bitmap with a font with an HTML document
with a spreadsheet... whether the result makes sense or not is up to you
and/or to the applications that will process the resulting file.

Probably, there should be two separate commands (or different options of the
same command): to do a raw byte-by-byte concatenation, and to do an
encoding-aware concatenation of text files.

E.g., imagine a "cat" command with these extensions:

        Synopsis
                cat [ -... ] [ -R encoding ] { [ -F encoding ] file }
        Description:
                ...
                If neither -R or -F's are specified, the concatenation is
done byte by byte.
        Options:
                ...
                -R specifies the encoding of the resulting *text* file;
                -F specifies the encoding of the following *text* file.

You command above would now expand to something like this:

cat -R UTF-16 -F UTF-16LE file1 -F Big-5 file2 > file3

Provided with information about the input encodings and the expected output
encoding, "cat" could now correctly handle BOM's, endianness, new-line
conventions, and even perform character set conversions. Without this extra
info, "cat" would retain its good ol' byte-by-byte functionality.

Similar options could be added to any Unix command potentially dealing with
text files ("cp", "head", "tail", etc.), as well as to their equivalents in
DOS or other operating systems.

_ Marco

Next message: Kent Karlsson: "RE: Several BOMs in the same file"
Previous message: Eric Rasmussen: "Re: CJK question"
Maybe in reply to: Stefan Persson: "Several BOMs in the same file"
Next in thread: Kent Karlsson: "RE: Several BOMs in the same file"
Reply: Kent Karlsson: "RE: Several BOMs in the same file"
Reply: Pim Blokland: "Re: Several BOMs in the same file"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Mar 25 2003 - 06:03:02 EST