Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 12:16:11 CST

Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 2005/01/20 14:36, Philippe Verdy at vpi92@yahoo.fr wrote:

>> I think the correct solution is that MS looks over its software, to
>> ensure
>> that files transferred off that platform get a distributed format.
>
> I don't understand your argument:
> - first it is not a MS bug, or MS specific problem.
> - In fact MS could deprecate rapidly the storage of BOM in UTF-8
> encoded text files, if it adopts a standard identification system for
> meta-data. MS already has the necessary support in NTFS, and is
> preparing LongHorn with a database driven filesytem. On such
> filesystem, storing BOM will no longer be necessary, because the
> effective fileformat will be stored out of band in a separate stream or
> in a meta-data repository.
> - I think that Mac filesytems can already store such meta-data
> information about fileformats within the resource fork.

MacOS X now has UNIX BSD at the bottom, and the so called resource fork is
split on several files, but bundled together as to look as a single file.

> - VMS filesystems already had support for multiple streams for a given
> file.

So, in short, you say that nobody has the need for BOM's. But still Unicode
does require them.

> If there's a problem, it's not from Microsoft, but from the initial
> design of Unix filesystems:
> - UFS promoted the file extension (which was initially meta-data in
> mainframe filesytems, in VMs or even in CPM) as part of the filename
> - UFS promoted the filename from a simple unique access key with a
> descriptive role. This was an error
> - Unix filesystems APIs were simplified to allow access to files only
> in terms of streams of binary bytes, without any indication of what
> they mean. the only remaining information can come from the filename,
> but Unix has NO standard for fileextensions.

Look at the post by Marcin 'Qrczak' Kowalczyk who describes a couple of more
issue. The problem runs deeper than the one of merely finding a file
specific encoding.

> For these reasons, almost all Unix applications that have to deal to
> various fileformats have to find a way to determine the effective
> fileformat: see the kludgy definition of "magic" files (which is not
> working effectively as it should). So applications need to develop
> their own storage for meta-data, and users must administrate this
> information whose binding to effective files in the filesystem is not
> guaranteed and secured by the filesystem or OS. All these are in fact
> serious defects of Unix. Don't blame Microsoft about it.
>
> Using BOM on Unix will greatly help to determine the file format. It is
> simple to parse the first few bytes (with the "magic" method), and then
> read and interpret all the rest of the file correctly. Storing a BOM at
> the beginning of a file is a way to paliate the absence of a separate
> stream to store meta-data.

Then one should develop file format standards, but still bundle it together
with a character format. On UNIX platforms, the main thing is to not screw
up the computing core, clearly.

> Thanks, HTTP and MIME, are not so much defective and propose better
> strategies. My view on this problem is that filesystems should opt for
> offering at least the same native support as HTTP and MIME, as part of
> the internal services offered and secured by the OS itself and its
> filesytems API. Emulation could be provided to support storage on
> legacy filesystems, but the OS should then hide this detail to
> applications, by refusing to honor a fileopen service on an internal
> meta-data storage, or by not listing the meta-data storage files when
> enumerating the contents of a directory.

Again, this is very laudable, but it should not be bundled together with a
character encoding standard.

> Instead, the proper way to access this information would be to open the
> supplementary stream, using a couple (filename, stream name), where the
> stream name would match standard MIME header names, and where the
> stream name would be empty/null/void to access to the main (content)
> stream. This requires a supplementary API for OS file services, and the
> OS should make all its efforts to make sure that files can be renamed
> and moved across a filesystem (and even across multiple volumes or
> remote locations) to keep the meta-data associated with the file in the
> new location (HTTP already has this implemented natively as part of its
> design, it is even required with HTTP/1.1; so why not a HTTP filesytem,
> offering all the same services as HTTP?).

But those ideas will take time to develop, and falls without the scope of a
character encoding. There is nothing wrong with those ideas as such,
otherwise.

Hans Aberg

Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 12:18:12 CST