Re: (Informational only: UTF-8 BOM and the real life)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sun, 29 Jul 2012 14:35:56 +0200

Thanks now, the effective filelength is a metadata stored outside of
the main stream (jsut like the filename(s) used in the hierarchic
filesystem). I just wonder why Linux/Unix filesystems and API still
refuse to integrate a more complete support for metadata in its
filesystems (including for increased and finer security controls like
ACLs which now evolve to become less host-centric and more
network/Internet-oriented with domains).
Of course there are various supports for metadata on other
filesystems, such as even the legacy MaxOS filesystem with resource
forks and VMS filesystems, whose features are integrated (many of them
being now supported with additional filesystem drivers on Linux/Unix).
But why isn't it formalized as a plain part of the OS in all its API
remains a mystory for me. The low-level single-stream API is now too
low and lacks granularity (we should even be able to organize
filesystems in a less hierarchical, more relational and more
object-oriented way).
Things like data signatures could even be separated more formally, as
well as the actual encoding of end-of-lines, paragraphs, or even word
boundaries. The I/O API would then treat files only as "views"
offering the wanted services to interact with files, including
plain-text files that are structured and should be augamentable with
any number of additional metadata on various levels.
At least the most interoperable filesystem that exists today is the
HTTP filesystem, and thanks it supports a basic layer of metadata,
with additional features like cache control and management of the
lifetime of objects, and basic security at the file level. But other
developments are still needed to extend it first to a full relational
filesystem, then to a full object oriented filesystem.
At that time, we will even wonder even more what is really "plain-text".
For now, Linux/Unix still has a limited support for text files, but
this is only realized in a user-space library, and not enforced by the
OS which exposes too much things that are modifiable indepenantly
without any control. Things like the internal encoding of text files,
or the encoding of end of lines are then managed by softwares in a
non-completely interoperable way, only because the requirements are
not checked and enforced. We sill leave in a world where complete
binary streams are interchanged, with difficulties to interpret them
simply because the neceassry metadata are not checked or not
transported as a requirement. With an objet-oriented design, based on
APIs rather than unstructured streams, we couldeven get more
performance (over long distance WAN networks with high latency), by
avoiding transporting lots of unqualified things that everyone
interprets and implements as he wants.
And even if we are seeng a recent development of "cloud" solutions,
this is not really the solution we need for the long term : this is
still the old-fashioned unbalanced model based on clients and servers,
instead of peer-to-peer systems where peers cooperate transparently to
offer the service in a virtual host that can be located anywhere,
where devices can connect and add their own computing power to the
system, and work transparently on the data belonging to separate
virtual spaces, and where things like redundancy, backups, resilience
to hardware failures, and needs for more power can be obtained
transaprently from the rest of the network, with automatic
optimizations to reduce the latency with automatic caching systems and
distributed validation of the data. The code to manage this data would
navigate from one node to the other transparently in the background,
based on demands. The way to interact with such system would be
exclusively through objects exposing their APIs, their security
requirements and managing the identities of actors and their access
rights.
Things that are old-fahsioned in the "stream" approach are for example
the filepositions. Ideally texts are just enumerations of objects like
paragraphs that are themselves structured as enumerations of lower
boundaries, up to the lowest level which is the code point level. Code
units (including the surrogates artefacts), bytes, encodings, data
compression do not belong to the definition of what is plain-text
(which should be transparently convertible to match what a user will
need to handle/show/transform in the most convenient way for him). And
there's no reason why we can't interact with texts only at the current
"file" level only, under only the same security realm and with a
single owner of the stream, when this stream could just be a private
view on larger objects managed collectively and not exposing the same
thing to everyone (a user, or a group, or a security domain, or an
application or service, or another object used to create distinct
views to any of them for specific needs).
Some day we will even forget what is UTF-8. And may be the correct
minimum level for handling text will only be the grapheme cluster
represented as an object though its own local API. There will be a
complete separation between text input, text storage, text
interchanges between computing nodes, text transforms, and text
output. Programmers will no longer write programs working at the
stream level (this level being defined only within the blackbox of the
underlying OS connecting users with their shared applications and data
accessible over a worldwide network from all kinds of devices.

This is a dream. We are far from this level, full network-based
peer-to-peer OSes still do not exist, we are still working too near
from the hardware level, and softwares are still not perceived as a
location-independant and hardware-indepedant service (as a consequence
we have now billions of computing devices connecting to the net, that
pass 95% of their on-time waiting without nothing to do, and the
impossibility to harness the extra computing power that is available
almsot everywhere, except when we need it locally, plus gigatons of
devices recomputing the same things with lots of energy and hardware
garbaged, and polluting our environments).

2012/7/29 John W Kennedy <jwkenne_at_attglobal.net>:
> On Jul 28, 2012, at 11:52 AM, Doug Ewell <doug_at_ewellic.org> wrote:
>> ^Z as an EOF marker for text files was part of the MS-DOS legacy from
>> CP/M, where all files were written to a multiple of the disk block size
>> (I think 128 for CP/M and 512 for MS-DOS 1.x), and there had to be some
>> way to tell where the real text content ended. New stream-based I/O
>> calls in MS-DOS 2.0 made this mechanism unnecessary. Unix systems had no
>> legacy from CP/M, so they never had this problem.
>
> Worse than that, actually. Actual MS-DOS APIs from 1.0 on were able to handle the situation, but the MS-DOS BASIC language and interpreter, with CP/M roots, assumed the 128-byte sector, and therefore demanded the ^Z. It was fixed as early as 1.1, I think, but the malady lingers on.
>
>
>
>>> I.e., this is why we do have this messy text OR binary file I/O
>>> distinction like O_BINARY (for open(2)), "b" (for fopen(3)) or
>>> binmode (perl(1)). Because without those a text file will see
>>> End-Of-File at the ^Z, not at the real end of the file.
>>
>> The reason for the text/binary distinction on DOS and Windows is
>> conversion between Unix-standard LF and Windows (DOS, CP/M)-standard
>> CRLF. It might be true that library calls to read a file in text mode
>> will stop at ^Z, but Notepad and Wordpad don't. I know the library
>> doesn't automatically write ^Z. Almost nobody in the MS world uses the
>> ^Z convention on purpose any more; many don't even know about it.
>>
>>> (Which rises the immediate question why the Microsoft programmers did
>>> not embed the meta information in this section at the end of the file.
>>> But i don't really want to know.)
>>
>> See above. The intent of ^Z was never to distinguish data from metadata,
>> as with the Mac data and resource forks.
>>
>> But of course none of this has anything to do with U+FEFF.
>>
>>> So do the programmers have to face the same conditions? I don't
>>> really think so. They prefer driving plain text readers up the wall.
>>> Successfully.
>>
>> Again, we don't really have this kind of evil intent, though it's often
>> fun and convenient for people to imagine we do.
>>
>> --
>> Doug Ewell | Thornton, Colorado, USA
>> http://www.ewellic.org | @DougEwell ­
>>
>
> --
> John W Kennedy
> "Give up vows and dogmas, and fixed things, and you may grow like That. ...you may come to think a blow bad, because it hurts, and not because it humiliates. You may come to think murder wrong, because it is violent, and not because it is unjust."
> -- G. K. Chesterton. "The Ball and the Cross"
>
>
>
>
>
Received on Sun Jul 29 2012 - 07:40:18 CDT

This archive was generated by hypermail 2.2.0 : Sun Jul 29 2012 - 07:40:19 CDT