Re: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (
Date: Thu Jan 20 2005 - 12:16:11 CST

  • Next message: Hans Aberg: "Re: Subject: Re: 32'nd bit & UTF-8"

    On 2005/01/20 14:36, Philippe Verdy at wrote:

    >> I think the correct solution is that MS looks over its software, to
    >> ensure
    >> that files transferred off that platform get a distributed format.
    > I don't understand your argument:
    > - first it is not a MS bug, or MS specific problem.
    > - In fact MS could deprecate rapidly the storage of BOM in UTF-8
    > encoded text files, if it adopts a standard identification system for
    > meta-data. MS already has the necessary support in NTFS, and is
    > preparing LongHorn with a database driven filesytem. On such
    > filesystem, storing BOM will no longer be necessary, because the
    > effective fileformat will be stored out of band in a separate stream or
    > in a meta-data repository.
    > - I think that Mac filesytems can already store such meta-data
    > information about fileformats within the resource fork.

    MacOS X now has UNIX BSD at the bottom, and the so called resource fork is
    split on several files, but bundled together as to look as a single file.

    > - VMS filesystems already had support for multiple streams for a given
    > file.

    So, in short, you say that nobody has the need for BOM's. But still Unicode
    does require them.

    > If there's a problem, it's not from Microsoft, but from the initial
    > design of Unix filesystems:
    > - UFS promoted the file extension (which was initially meta-data in
    > mainframe filesytems, in VMs or even in CPM) as part of the filename
    > - UFS promoted the filename from a simple unique access key with a
    > descriptive role. This was an error
    > - Unix filesystems APIs were simplified to allow access to files only
    > in terms of streams of binary bytes, without any indication of what
    > they mean. the only remaining information can come from the filename,
    > but Unix has NO standard for fileextensions.

    Look at the post by Marcin 'Qrczak' Kowalczyk who describes a couple of more
    issue. The problem runs deeper than the one of merely finding a file
    specific encoding.

    > For these reasons, almost all Unix applications that have to deal to
    > various fileformats have to find a way to determine the effective
    > fileformat: see the kludgy definition of "magic" files (which is not
    > working effectively as it should). So applications need to develop
    > their own storage for meta-data, and users must administrate this
    > information whose binding to effective files in the filesystem is not
    > guaranteed and secured by the filesystem or OS. All these are in fact
    > serious defects of Unix. Don't blame Microsoft about it.
    > Using BOM on Unix will greatly help to determine the file format. It is
    > simple to parse the first few bytes (with the "magic" method), and then
    > read and interpret all the rest of the file correctly. Storing a BOM at
    > the beginning of a file is a way to paliate the absence of a separate
    > stream to store meta-data.

    Then one should develop file format standards, but still bundle it together
    with a character format. On UNIX platforms, the main thing is to not screw
    up the computing core, clearly.

    > Thanks, HTTP and MIME, are not so much defective and propose better
    > strategies. My view on this problem is that filesystems should opt for
    > offering at least the same native support as HTTP and MIME, as part of
    > the internal services offered and secured by the OS itself and its
    > filesytems API. Emulation could be provided to support storage on
    > legacy filesystems, but the OS should then hide this detail to
    > applications, by refusing to honor a fileopen service on an internal
    > meta-data storage, or by not listing the meta-data storage files when
    > enumerating the contents of a directory.

    Again, this is very laudable, but it should not be bundled together with a
    character encoding standard.

    > Instead, the proper way to access this information would be to open the
    > supplementary stream, using a couple (filename, stream name), where the
    > stream name would match standard MIME header names, and where the
    > stream name would be empty/null/void to access to the main (content)
    > stream. This requires a supplementary API for OS file services, and the
    > OS should make all its efforts to make sure that files can be renamed
    > and moved across a filesystem (and even across multiple volumes or
    > remote locations) to keep the meta-data associated with the file in the
    > new location (HTTP already has this implemented natively as part of its
    > design, it is even required with HTTP/1.1; so why not a HTTP filesytem,
    > offering all the same services as HTTP?).

    But those ideas will take time to develop, and falls without the scope of a
    character encoding. There is nothing wrong with those ideas as such,

      Hans Aberg

    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 12:18:12 CST