RE: Subject: Re: 32'nd bit & UTF-8

From: Lars Kristan (lars.kristan@hermes.si)
Date: Thu Jan 20 2005 - 08:47:19 CST

Next message: Mark E. Shoulson: "Re: Subject: Re: 32'nd bit & UTF-8"

Previous message: Mark E. Shoulson: "Re: 32'nd bit & UTF-8"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy wrote:
> I don't understand your argument:
> - first it is not a MS bug, or MS specific problem.
> - In fact MS could deprecate rapidly the storage of BOM in UTF-8
> encoded text files, if it adopts a standard identification system for
> meta-data. MS already has the necessary support in NTFS, and is
> preparing LongHorn with a database driven filesytem. On such
> filesystem, storing BOM will no longer be necessary, because the
> effective fileformat will be stored out of band in a separate
> stream or
> in a meta-data repository.
> - I think that Mac filesytems can already store such meta-data
> information about fileformats within the resource fork.
> - VMS filesystems already had support for multiple streams for a given
> file.

Many filesystems have support for separate streams. Mostly these are used
for security information and such. Microsoft alredy attempted to store
document properties in them (author, summary and such), but abandoned the
concept. Why? Because many file transfer protocols exist that do not support
substreams. And security is handled so differently on all platforms that it
is actually wise to strip it. Anything else is doomed to be lost. Or can you
persuade me that all protocols will support it, including mail attachments,
zip, rar, floppy, CD-ROM... Maybe, in some future, but not soon enough so we
could make use of it. Hence, BOM will remain for quite a while.

BTW, I'd also be worried with any of these approaches what happens if
information contradicts. File's meta stream would indicate UTF-16, it would
start with a UTF-8 BOM and HTTP meta data would be Latin 1. Probably a
matter of priority, but could prove to be confusing anyway.

>
> If there's a problem, it's not from Microsoft, but from the initial
> design of Unix filesystems:
> - UFS promoted the file extension (which was initially meta-data in
> mainframe filesytems, in VMs or even in CPM) as part of the filename
> - UFS promoted the filename from a simple unique access key with a
> descriptive role. This was an error
> - Unix filesystems APIs were simplified to allow access to files only
> in terms of streams of binary bytes, without any indication of what
> they mean. the only remaining information can come from the filename,
> but Unix has NO standard for fileextensions.

Must be the pipes. Once you introduce the meta data, you need to drag it
everywhere. And it doesn't hold water. Strictly speaking, you would need two
meta tags, one for the file contents, and one for the file name. Then it
gets worse. You assemble two files or run grep. How do you tag the output?
You can of course convert everything, but at the time there was no Unicode
to convert to. And, yes, it requires the text mode. UNIX got away with not
introducing it. With many useful consequences.

Plus, you need to keep tagging things, it cannot always be done
automatically. In the early days, so few files were produced that tagging
was not such a big issue. On some systems you were supposed to define the
initial file size, the extension increment when appending, maximum number of
extensions, ... Useful in some cases, especially if you want to keep the
disk fragmentation low and prevent files from using up the disk space, but
very user unfriendly. UNIX does what I think is a good approach. Assume all
data is in the same format or encoding and let the user fix the problem at
retrieval time. Once everything is UTF-8, this would work. Might be a bit of
a pain during the transition, but code page mixups were always there and we
know how to deal with them. The only problem is that you need only ONE
format. UTF-8 is that format and all other UTFs would be implicitly
discouraged in this approach. Which is perhaps not bad at all.

> Using BOM on Unix will greatly help to determine the file
> format. It is
> simple to parse the first few bytes (with the "magic"
> method), and then
> read and interpret all the rest of the file correctly.
> Storing a BOM at
> the beginning of a file is a way to paliate the absence of a separate
> stream to store meta-data.

It requires text mode API. Or you need to deal with it yourself in all the
programs. In either case, a lot of work, not only for developers, but also
for users, which will sometimes need to decide. As well as bang their heads
when they'll forget to do so.

>
> Thanks, HTTP and MIME, are not so much defective and propose better
> strategies. My view on this problem is that filesystems should opt for
> offering at least the same native support as HTTP and MIME, as part of
> the internal services offered and secured by the OS itself and its
> filesytems API. Emulation could be provided to support storage on
> legacy filesystems, but the OS should then hide this detail to
> applications, by refusing to honor a fileopen service on an internal
> meta-data storage, or by not listing the meta-data storage files when
> enumerating the contents of a directory.
>
> Instead, the proper way to access this information would be
> to open the
> supplementary stream, using a couple (filename, stream name),
> where the
> stream name would match standard MIME header names, and where the
> stream name would be empty/null/void to access to the main (content)
> stream. This requires a supplementary API for OS file
> services, and the

I'd say implementing the separate stream on all levels is even more
difficult than implementing the text mode. And it has to be done on all
platforms, not just UNIX. But, yes, it does get rid of the BOM.
Alternatively, one could adopt the filename as THE container to hold
identification, description, type and - format. So, "My File.utf8.txt".
Microsoft already has a habit of hiding the extensions. Hide the format and
there you have your meta data. Portable. Of course, much like I prefer to
see the extensions, I would also opt to see the format.

Not that I say this is how it should be done. Though, perhaps Microsoft
could use it to get the meta tag they need, and get rid of the BOM. They
already have the text mode, already do transform, so they can mix data from
several different streams into one. Perhaps they could have something like
"stdout.UTF-16", "stdout.UTF-8". Anyway, UNIX can't mix different streams
and needn't. At least not the UNIX as we know it. As long as UTF-8 prevails
as THE format for interchange.

Lars

Next message: Mark E. Shoulson: "Re: Subject: Re: 32'nd bit & UTF-8"
Previous message: Mark E. Shoulson: "Re: 32'nd bit & UTF-8"
Maybe in reply to: Arcane Jill: "Subject: Re: 32'nd bit & UTF-8"
Next in thread: Lars Kristan: "RE: Subject: Re: 32'nd bit & UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 08:48:30 CST