Re: UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)

From: Hans Aberg ([email protected])
Date: Thu Jan 20 2005 - 12:16:29 CST

Next message: Rick McGowan: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Lars Kristan: "UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: UTF-8 'BOM'"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/20 14:15, Lars Kristan at [email protected] wrote:

> Hans Aberg wrote:
>> The main point is that BOM will not be specially treated in
>> the UNIX world,
>> regardless what Unicode says.
>
> I won't say it won't, I won't say it will.
>
> UTF-8 BOM breaks down the UNIX right in the foundation. Which is because text
> is often treated as opaque, almost binary data.

Right.

> UTF-8 BOM is in a way no worse than UTF-16 BOM. Except that UTF-16 plain text
> is rare.

The problem with UTF-8 BOM is that in 8-bits, one wants to use the 8-bit
extended ASCII handling already available. Since one does not expect to use
16-bits that way, there is much less problem from the point of view of
implementing the OS.

> This is a lot of work and a lot of confusion.

So it seems.

>...
> Examine the above lists and see how things are strongly related. It is
> practically impossible to allow BOM on UNIX without introducing the text mode.

It clear that using UTF-8 with BOM requirement on UNIX will cause a lot of
problems. And it is unclear how to find effective solutions.

> If we think that UTF-8 will be THE encoding to be used for decades, then we
> shouldn't burden it with the BOM.

So I think too. The idea of having file markers tied to OS file handling
seems to be an archaic one. Unicode, in effect, tries to rune the clock
back.

> If we think other formats will start
> gaining, then we will need the mechanism to distinguish among them and text
> mode is inevitable. But, introducing text mode on UNIX will be a pain. UNIX
> would much rather go with exising binary approach and stick with UTF-8 as the
> format to stay.

There are already UNIX versions, such as Mac OS X, making that distinction
by introducing extra files, or "resource" files. It easier on the basic OS
level to make use of several binary files bundled together as one unit,
rather than having a single file with all the information. Unicode break the
possibles to develop the most efficient OS.

>> So I guess MS does not want its
>> text files to
>> be read in the UNIX world.
>
> There are many possibilities. Maybe they are convinced their approach is
> better. Maybe they know it is not, but are convinced this approach will
> prevail. And want to be a part of it.
>
> Maybe it's a conspiracy. Maybe they really don't want their files to be useful
> on UNIX. Funny, it's even worse: as I pointed out, Notepad doesn't even
> display UNIX files properly unless LFs are extended into CRLF pairs. One would
> expect at least a one way compatibility, to help users to 'move' to Windows.
> But perhaps they think a clean cut is even more 'convincing'.
>
> I am not saying any of the above scenarios is true. Maybe it just happens
> naturally. I suppose it does. UTF-8 BOM is simply useful on Windows. And
> simply devastating on UNIX. It's the text mode that makes it that way.

Conspiracy or not, it just seems to happen that big companies develop their
own formats, incompatible with other's formats. Standards, which should
properly help up the problem, gets corrupted.

>> Unicode has made the mistake of favoring a
>> special platform over all the others.
>
> I am not sure what Unicode says about the UTF-8 BOM. I assume it is loosely
> allowed.

Folks say here that the BOM is actually required in at the beginning of
files, in order for them to be allowed to be called UTF-8. Then it is
obvious it is a file format, not a character encoding.

>Which is actually the best Unicode can come up with. Deciding on one
> approach over another before FULL implications are understood would be a
> mistake.

Unicode has evidently already made that mistake.

>And it would hit one platform or the other. Waiting for the problem
> to be fully understood is NOT a mistake. And loosely allowing it is waiting.

Just dropping the BOM as a requirement in UTF-8 would remove the problem.
BOM's need not even be recognized by UTF-8, because one still use them to
define a file format. Then it only means that the UTF-8 proper code is what
none gets when the BOM has been removed.

Hans Aberg

Next message: Rick McGowan: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
In reply to: Lars Kristan: "UTF-8 'BOM' (was RE: Subject: Re: 32'nd bit & UTF-8)"
Next in thread: Marcin 'Qrczak' Kowalczyk: "Re: UTF-8 'BOM'"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 12:18:42 CST