UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Jan 20 2005 - 07:15:37 CST

Next message: Christopher Fynn: "Re: 32'nd bit & UTF-8"

Previous message: Christopher Fynn: "Re: UTF-8 'BOM'"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Hans Aberg: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Reply: Hans Aberg: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: UTF-8 'BOM'"
Reply: Antoine Leca: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hans Aberg wrote:
> The main point is that BOM will not be specially treated in
> the UNIX world,
> regardless what Unicode says.

I won't say it won't, I won't say it will.

UTF-8 BOM breaks down the UNIX right in the foundation. Which is because
text is often treated as opaque, almost binary data.

UTF-8 BOM is in a way no worse than UTF-16 BOM. Except that UTF-16 plain
text is rare. Here is why:

* If you wanted to process UTF-16 text, you'd need a new set of functions
and programs.
* Often you can process UTF-8 text with the old functions and programs.
* If you want to process UTF-8 with the old functions and programs, the BOM
will introduce problems.
* If you want to process UTF-8 with BOM, you need a new set of functions and
programs.

Suppose you go for the latter. You can solve the CRLF and UTF-8 BOM problem
together. You need to introduce the 'text' mode on UNIX. Windows has it. In
fopen, in ftp, in many utilities (compare, even copy!).

So, what you need is fopen to have bin and text modes. In text mode, it
strips the CRs, automatically determines the file format, strips the BOM and
converts the data into the 'running' encoding (say UTF-8). You need to
specify a bit more when creating a file, but not necessarily if you favor
one format over the rest of them.

Consequently, all (ok, many) programs need to get new command line switches
to decide which mode to use. It is interesting that text mode is the default
in Windows. Both for the fopen and many commands (even the copy command,
when copying to a device!).

This is a lot of work and a lot of confusion. It is interesting that such an
approach would go along with a filesystem that would store data internally
in Unicode and adjust the filenames according to user's locale. Actually,
the two approaches go together as far as that one would not work without the
other.

Now, who will dare say which way things will go? It is interesting that the
idea of separating text and binary data is something several Unicoders
proudly use as an argument when I speak of potential problems when legacy
encoded data is mixed with UTF-8 encoded data. Some things are indeed easier
if you separate text and binary data. But that is easy if you already have
that separation. Like in a database. Or on Windows, since the separation
goes way back. But not on UNIX.

OK, so what do we have:
* On Windows, files are CRLF delimited.
* On Windows, fopen has a text mode, programs have a /b switch.
* On Windows, text mode is typically the default.
* On Windows, filenames and files are converted to user's code page.
* On Windows, filenames can be and by default are case insensitive.
* On Windows, a BOM is used in UTF-8 streams and has proven to be useful.
* On Windows, command line is neglected. Windows has serious problems with
introducing UTF-8 support into the console, because the text mode of the
standard rtlib still only handles the CRLF, but not the BOM.

* On UNIX, files are LF delimited.
* On UNIX, fopen is always binary, text mode is rare and even then very
close to binary.
* On UNIX, text mode often doesn't apply, so UNIX is by default in binary
mode.
* On UNIX, filenames are presented and treated as opaque strings.
* On UNIX, filenames are therefore case sensitive.
* On UNIX, if a BOM is used in UTF-8 streams, most everything breaks.
* On UNIX, command line and scripting is very strong, the vulnerability to
UTF-8 BOM is very high.

Examine the above lists and see how things are strongly related. It is
practically impossible to allow BOM on UNIX without introducing the text
mode.

And vice versa. If you introduce the text mode, then you rely heavily on
distinguishing between various formats, as well as distinguishing between
UTF-8 and legacy 8-bit text data. Aaaahhhhh, then you DO need the UTF-8 BOM!

So, one needs to decide. Either favor the distinction between text and
binary data AND allow UTF-8 BOM, or drop the distinction and ban the UTF-8
BOM.

Now, one bad thing about the UTF-8 BOM is we wouldn't need it if there was
no legacy data. And we won't need it when legacy data is practically gone
(some say it will be soon, but ... I wouldn't bet on it). We might be stuck
with the BOM for decades, long after it will be useless. Just like the CRLF
pair, which was introduced on teletype machines because they were unable to
physically complete the CR in 150 milliseconds. It is still around and
causing nothing but trouble.

If we think that UTF-8 will be THE encoding to be used for decades, then we
shouldn't burden it with the BOM. If we think other formats will start
gaining, then we will need the mechanism to distinguish among them and text
mode is inevitable. But, introducing text mode on UNIX will be a pain. UNIX
would much rather go with exising binary approach and stick with UTF-8 as
the format to stay.

Maybe some UNICES will decide to go the text mode way. Maybe none will. It
depends on whether the BOM problem can be handled on its own. Maybe it would
be enough to modify some programs. The cat can get a /b switch and by
default strip a UTF-8 BOM. Programs that really are intended for text should
strip it also and don't even need a switch. If UNIX can get away with it,
full blown text mode implementation will not be needed.

In the end notice that not having a BOM, and not having the text mode on
UNIX also leads to coexistence of UTF-8 and legacy encoded data. Which
brings us back to the invalid sequences in UTF-8.

> So I guess MS does not want its
> text files to
> be read in the UNIX world.

There are many possibilities. Maybe they are convinced their approach is
better. Maybe they know it is not, but are convinced this approach will
prevail. And want to be a part of it.
Maybe it's a conspiracy. Maybe they really don't want their files to be
useful on UNIX. Funny, it's even worse: as I pointed out, Notepad doesn't
even display UNIX files properly unless LFs are extended into CRLF pairs.
One would expect at least a one way compatibility, to help users to 'move'
to Windows. But perhaps they think a clean cut is even more 'convincing'.

I am not saying any of the above scenarios is true. Maybe it just happens
naturally. I suppose it does. UTF-8 BOM is simply useful on Windows. And
simply devastating on UNIX. It's the text mode that makes it that way.

> Unicode has made the mistake of favoring a
> special platform over all the others.

I am not sure what Unicode says about the UTF-8 BOM. I assume it is loosely
allowed. Which is actually the best Unicode can come up with. Deciding on
one approach over another before FULL implications are understood would be a
mistake. And it would hit one platform or the other. Waiting for the problem
to be fully understood is NOT a mistake. And loosely allowing it is waiting.

Lars

Next message: Christopher Fynn: "Re: 32'nd bit & UTF-8"
Previous message: Christopher Fynn: "Re: UTF-8 'BOM'"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Hans Aberg: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Reply: Hans Aberg: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Reply: Marcin 'Qrczak' Kowalczyk: "Re: UTF-8 'BOM'"
Reply: Antoine Leca: "Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Maybe reply: Lars Kristan: "RE: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 07:16:31 CST