RE: Subject: Re: 32'nd bit & UTF-8

From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Jan 20 2005 - 08:47:19 CST

  • Next message: Mark E. Shoulson: "Re: Subject: Re: 32'nd bit & UTF-8"

    Philippe Verdy wrote:
    > I don't understand your argument:
    > - first it is not a MS bug, or MS specific problem.
    > - In fact MS could deprecate rapidly the storage of BOM in UTF-8
    > encoded text files, if it adopts a standard identification system for
    > meta-data. MS already has the necessary support in NTFS, and is
    > preparing LongHorn with a database driven filesytem. On such
    > filesystem, storing BOM will no longer be necessary, because the
    > effective fileformat will be stored out of band in a separate
    > stream or
    > in a meta-data repository.
    > - I think that Mac filesytems can already store such meta-data
    > information about fileformats within the resource fork.
    > - VMS filesystems already had support for multiple streams for a given
    > file.

    Many filesystems have support for separate streams. Mostly these are used
    for security information and such. Microsoft alredy attempted to store
    document properties in them (author, summary and such), but abandoned the
    concept. Why? Because many file transfer protocols exist that do not support
    substreams. And security is handled so differently on all platforms that it
    is actually wise to strip it. Anything else is doomed to be lost. Or can you
    persuade me that all protocols will support it, including mail attachments,
    zip, rar, floppy, CD-ROM... Maybe, in some future, but not soon enough so we
    could make use of it. Hence, BOM will remain for quite a while.

    BTW, I'd also be worried with any of these approaches what happens if
    information contradicts. File's meta stream would indicate UTF-16, it would
    start with a UTF-8 BOM and HTTP meta data would be Latin 1. Probably a
    matter of priority, but could prove to be confusing anyway.

    >
    > If there's a problem, it's not from Microsoft, but from the initial
    > design of Unix filesystems:
    > - UFS promoted the file extension (which was initially meta-data in
    > mainframe filesytems, in VMs or even in CPM) as part of the filename
    > - UFS promoted the filename from a simple unique access key with a
    > descriptive role. This was an error
    > - Unix filesystems APIs were simplified to allow access to files only
    > in terms of streams of binary bytes, without any indication of what
    > they mean. the only remaining information can come from the filename,
    > but Unix has NO standard for fileextensions.

    Must be the pipes. Once you introduce the meta data, you need to drag it
    everywhere. And it doesn't hold water. Strictly speaking, you would need two
    meta tags, one for the file contents, and one for the file name. Then it
    gets worse. You assemble two files or run grep. How do you tag the output?
    You can of course convert everything, but at the time there was no Unicode
    to convert to. And, yes, it requires the text mode. UNIX got away with not
    introducing it. With many useful consequences.

    Plus, you need to keep tagging things, it cannot always be done
    automatically. In the early days, so few files were produced that tagging
    was not such a big issue. On some systems you were supposed to define the
    initial file size, the extension increment when appending, maximum number of
    extensions, ... Useful in some cases, especially if you want to keep the
    disk fragmentation low and prevent files from using up the disk space, but
    very user unfriendly. UNIX does what I think is a good approach. Assume all
    data is in the same format or encoding and let the user fix the problem at
    retrieval time. Once everything is UTF-8, this would work. Might be a bit of
    a pain during the transition, but code page mixups were always there and we
    know how to deal with them. The only problem is that you need only ONE
    format. UTF-8 is that format and all other UTFs would be implicitly
    discouraged in this approach. Which is perhaps not bad at all.

    > Using BOM on Unix will greatly help to determine the file
    > format. It is
    > simple to parse the first few bytes (with the "magic"
    > method), and then
    > read and interpret all the rest of the file correctly.
    > Storing a BOM at
    > the beginning of a file is a way to paliate the absence of a separate
    > stream to store meta-data.

    It requires text mode API. Or you need to deal with it yourself in all the
    programs. In either case, a lot of work, not only for developers, but also
    for users, which will sometimes need to decide. As well as bang their heads
    when they'll forget to do so.

    >
    > Thanks, HTTP and MIME, are not so much defective and propose better
    > strategies. My view on this problem is that filesystems should opt for
    > offering at least the same native support as HTTP and MIME, as part of
    > the internal services offered and secured by the OS itself and its
    > filesytems API. Emulation could be provided to support storage on
    > legacy filesystems, but the OS should then hide this detail to
    > applications, by refusing to honor a fileopen service on an internal
    > meta-data storage, or by not listing the meta-data storage files when
    > enumerating the contents of a directory.
    >
    > Instead, the proper way to access this information would be
    > to open the
    > supplementary stream, using a couple (filename, stream name),
    > where the
    > stream name would match standard MIME header names, and where the
    > stream name would be empty/null/void to access to the main (content)
    > stream. This requires a supplementary API for OS file
    > services, and the

    I'd say implementing the separate stream on all levels is even more
    difficult than implementing the text mode. And it has to be done on all
    platforms, not just UNIX. But, yes, it does get rid of the BOM.
    Alternatively, one could adopt the filename as THE container to hold
    identification, description, type and - format. So, "My File.utf8.txt".
    Microsoft already has a habit of hiding the extensions. Hide the format and
    there you have your meta data. Portable. Of course, much like I prefer to
    see the extensions, I would also opt to see the format.

    Not that I say this is how it should be done. Though, perhaps Microsoft
    could use it to get the meta tag they need, and get rid of the BOM. They
    already have the text mode, already do transform, so they can mix data from
    several different streams into one. Perhaps they could have something like
    "stdout.UTF-16", "stdout.UTF-8". Anyway, UNIX can't mix different streams
    and needn't. At least not the UNIX as we know it. As long as UTF-8 prevails
    as THE format for interchange.

    Lars



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 08:48:30 CST