UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Jan 20 2005 - 07:15:37 CST

  • Next message: Christopher Fynn: "Re: 32'nd bit & UTF-8"

    Hans Aberg wrote:
    > The main point is that BOM will not be specially treated in
    > the UNIX world,
    > regardless what Unicode says.

    I won't say it won't, I won't say it will.

    UTF-8 BOM breaks down the UNIX right in the foundation. Which is because
    text is often treated as opaque, almost binary data.

    UTF-8 BOM is in a way no worse than UTF-16 BOM. Except that UTF-16 plain
    text is rare. Here is why:

    * If you wanted to process UTF-16 text, you'd need a new set of functions
    and programs.
    * Often you can process UTF-8 text with the old functions and programs.
    * If you want to process UTF-8 with the old functions and programs, the BOM
    will introduce problems.
    * If you want to process UTF-8 with BOM, you need a new set of functions and
    programs.

    Suppose you go for the latter. You can solve the CRLF and UTF-8 BOM problem
    together. You need to introduce the 'text' mode on UNIX. Windows has it. In
    fopen, in ftp, in many utilities (compare, even copy!).

    So, what you need is fopen to have bin and text modes. In text mode, it
    strips the CRs, automatically determines the file format, strips the BOM and
    converts the data into the 'running' encoding (say UTF-8). You need to
    specify a bit more when creating a file, but not necessarily if you favor
    one format over the rest of them.

    Consequently, all (ok, many) programs need to get new command line switches
    to decide which mode to use. It is interesting that text mode is the default
    in Windows. Both for the fopen and many commands (even the copy command,
    when copying to a device!).

    This is a lot of work and a lot of confusion. It is interesting that such an
    approach would go along with a filesystem that would store data internally
    in Unicode and adjust the filenames according to user's locale. Actually,
    the two approaches go together as far as that one would not work without the
    other.

    Now, who will dare say which way things will go? It is interesting that the
    idea of separating text and binary data is something several Unicoders
    proudly use as an argument when I speak of potential problems when legacy
    encoded data is mixed with UTF-8 encoded data. Some things are indeed easier
    if you separate text and binary data. But that is easy if you already have
    that separation. Like in a database. Or on Windows, since the separation
    goes way back. But not on UNIX.

    OK, so what do we have:
    * On Windows, files are CRLF delimited.
    * On Windows, fopen has a text mode, programs have a /b switch.
    * On Windows, text mode is typically the default.
    * On Windows, filenames and files are converted to user's code page.
    * On Windows, filenames can be and by default are case insensitive.
    * On Windows, a BOM is used in UTF-8 streams and has proven to be useful.
    * On Windows, command line is neglected. Windows has serious problems with
    introducing UTF-8 support into the console, because the text mode of the
    standard rtlib still only handles the CRLF, but not the BOM.

    * On UNIX, files are LF delimited.
    * On UNIX, fopen is always binary, text mode is rare and even then very
    close to binary.
    * On UNIX, text mode often doesn't apply, so UNIX is by default in binary
    mode.
    * On UNIX, filenames are presented and treated as opaque strings.
    * On UNIX, filenames are therefore case sensitive.
    * On UNIX, if a BOM is used in UTF-8 streams, most everything breaks.
    * On UNIX, command line and scripting is very strong, the vulnerability to
    UTF-8 BOM is very high.

    Examine the above lists and see how things are strongly related. It is
    practically impossible to allow BOM on UNIX without introducing the text
    mode.

    And vice versa. If you introduce the text mode, then you rely heavily on
    distinguishing between various formats, as well as distinguishing between
    UTF-8 and legacy 8-bit text data. Aaaahhhhh, then you DO need the UTF-8 BOM!

    So, one needs to decide. Either favor the distinction between text and
    binary data AND allow UTF-8 BOM, or drop the distinction and ban the UTF-8
    BOM.

    Now, one bad thing about the UTF-8 BOM is we wouldn't need it if there was
    no legacy data. And we won't need it when legacy data is practically gone
    (some say it will be soon, but ... I wouldn't bet on it). We might be stuck
    with the BOM for decades, long after it will be useless. Just like the CRLF
    pair, which was introduced on teletype machines because they were unable to
    physically complete the CR in 150 milliseconds. It is still around and
    causing nothing but trouble.

    If we think that UTF-8 will be THE encoding to be used for decades, then we
    shouldn't burden it with the BOM. If we think other formats will start
    gaining, then we will need the mechanism to distinguish among them and text
    mode is inevitable. But, introducing text mode on UNIX will be a pain. UNIX
    would much rather go with exising binary approach and stick with UTF-8 as
    the format to stay.

    Maybe some UNICES will decide to go the text mode way. Maybe none will. It
    depends on whether the BOM problem can be handled on its own. Maybe it would
    be enough to modify some programs. The cat can get a /b switch and by
    default strip a UTF-8 BOM. Programs that really are intended for text should
    strip it also and don't even need a switch. If UNIX can get away with it,
    full blown text mode implementation will not be needed.

    In the end notice that not having a BOM, and not having the text mode on
    UNIX also leads to coexistence of UTF-8 and legacy encoded data. Which
    brings us back to the invalid sequences in UTF-8.

    > So I guess MS does not want its
    > text files to
    > be read in the UNIX world.

    There are many possibilities. Maybe they are convinced their approach is
    better. Maybe they know it is not, but are convinced this approach will
    prevail. And want to be a part of it.
    Maybe it's a conspiracy. Maybe they really don't want their files to be
    useful on UNIX. Funny, it's even worse: as I pointed out, Notepad doesn't
    even display UNIX files properly unless LFs are extended into CRLF pairs.
    One would expect at least a one way compatibility, to help users to 'move'
    to Windows. But perhaps they think a clean cut is even more 'convincing'.

    I am not saying any of the above scenarios is true. Maybe it just happens
    naturally. I suppose it does. UTF-8 BOM is simply useful on Windows. And
    simply devastating on UNIX. It's the text mode that makes it that way.

    > Unicode has made the mistake of favoring a
    > special platform over all the others.

    I am not sure what Unicode says about the UTF-8 BOM. I assume it is loosely
    allowed. Which is actually the best Unicode can come up with. Deciding on
    one approach over another before FULL implications are understood would be a
    mistake. And it would hit one platform or the other. Waiting for the problem
    to be fully understood is NOT a mistake. And loosely allowing it is waiting.

    Lars



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 07:16:31 CST