[off] XML. And RAM

From: Theodore H. Smith (delete@elfdata.com)
Date: Mon Aug 11 2003 - 06:49:31 EDT

Next message: John Cowan: "Re: Newbie Question - what are all those duplicated characters FO R?"

Previous message: Kent Karlsson: "RE: Newbie Question - what are all those duplicated characters FOR?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This is a really long email, I realised, so I'll section it.

Sorry about it being off topic. My ElfData plugin does extensively use
Unicode, though. You can see for yourself, at www.elfdata.com/plugin/ ,
and XML is Unicode also. I just thought that seeing as IBM people are
here, they are usually good in large scale solutions...

__aim summary__

I'm thinking about formalising the RAM management in my ElfData (string
processing) plugin, to let it handle processing, and writing to files
larger than is kept in the RAM. Lets say I want to process a 2GB file,
but I want to do it with only 1MB allocated in RAM. That kind of thing.

__answers please!__

My question, is does anyone know information about this kind of domain,
on the internet? Or has anyone dealt with this, before?

We all know programs, on OS9, that can handle huge amounts of data,
with little RAM. They do this by only keeping a small piece of it in
the RAM, at a time.

__What should my plugin do, why, and what is "it's place"__

Now, I've pretty much come to the conclusion that it's not my plugin's
place to do the file system interaction, or RAM management. I just
can't make it do enough for everyone, or even myself. It's usually
better to let some code be good at one thing, instead of trying to be
the "everything to everyone".

However, I do think it's my plugin's place to offer hooks, to let the
RB code do the RAM management.

Currently, I have a kind of crude setup, for managing "families", which
is a consequence of splitting strings by reference (instead of
splitting by copy which if what RB does).

I'm wondering if anyone has any kind of advice here? Or even some
thoughts about how it should be done?

__Heres some design requirements:__

* Would like to make my XML Engine capable of processing gigabytes of
XML, with only a few MB of RAM allocated.
* Would like to make my XML Editor do the same, except that it also
will need to display a tree structure on it's editor.
* Obviously, my ElfData string processing plugin will need some
improvements to allow this.
* I don't want to slow down my ElfData plugin noticeably, if possible.
Or make it over complex.

This has been done before! I am sure that programs like gcc, and
CodeWarrior, compile files larger than can be stored in the RAM,
right??? Or am I wrong?

__Speculative solutions__

Heres my thoughts on the issue, some tentative design speculations.
Probably incomplete speculations that will need refinement.

So, basically, it's about resources. Resources need to be formalised.
Instead of just having some loosely defined system of allocating data,
and disposing it when there are no references, I should formalise my
access to resources. Currently, it's hard to get an idea of what is
allocated where. One byte reference to an ElfData object, can stop a
multi megabyte block of RAM being deallocated.

Lets say I have a resource, thats a 2gb XML file. I want to do stuff to
this XML file.

1) Parse, and validate well-formedness
2) validate via DTD or schema
3) Manipulate, and view graphically
4) Edit as a text file
5) Save back to the hard disk

Currently, I have two options. RB's way (split by copy), and my way
(split by reference). The ElfData plugin can split by copy, also... I
just want to avoid this.

__Managing RAM storage and disposal__

Heres a hypothetical design flaw. I don't know of a real design flaw
like this, because if I did, I'd fix it, but I want my app to even
manage cases where I do miss a piece here and there...

Now, lets say I store something from a file, somewhere in my app. I'll
say my design flaw was copying to an internal clipboard. So now, if the
user closes the file, the 2GB of data will still be allocated, because
of one reference to a tiny part of it.

I'm thinking, that the best way to deal with that, is by making some
kind of "ElfDataResource" class. This one, would manage splitting, and
saving references to ElfData objects. So basically, each time I split,
I append a reference to that object in my ElfDataResource (EDR) class.
So, when my ElfData is disposed, it should tell the EDR class to remove
it's reference. Doing that fast enough to not interferre with
splitting, already is a bit of a technical problem.

When the EDR class is told to close the resource, it should update all
the ElfData objects, that haven't been disposed still, to contain a
copy of the data, instead of the original. The EDR class should also
give me statistics on RAM management. That can really help.

Thats what I mean by "formalising" my RAM usage.

But then that still doesn't deal with how to process only a tiny part
of my data in the RAM, and keep most on disk....

__Speculative: My own application's/libary's practical application__

I'm not sure how best to go about this. My guess is this is more of a
"database" kind of thing. IE, fast access to a disk. Lets say I wanted
to use something like Valentina (rumoured to be very fast), to do most
of my work, so that I don't have to re-write the wheel. That would only
be for stuff like storing my graphical editor's interface, mind, not
the text. Another idea might be to have a different kind of paradigm
for browsing my XML. Maybe something more like the filesystem? The
Mac's file system doesn't need to contain a whole hard disk in RAM to
be able to let me navigate it's files, so maybe my XML could take a
similar approach?? It's an idea....

For reading the data in from the hard disk, without putting the whole
file in RAM, that's another problem. I'm not sure how to do it
really.... I think with XML, it's not so big a problem. Everything is
designed with tags. One tag itself is almost ALWAYS small (never seen a
1kb tag even). The bits inbetween tags, may be very large, i'm sure
there could be 1mb or more text inside a tag. elements may contain more
elements, but that's not a problem. So I guess I could break down my
problem along those lines. It shouldn't be too hard.

Another requirement is how to store my XML objects... Should I put the
data back onto a Valentina-like data base? Or read them directly from
the file??

__Is it worth it!!__

At some point I have to think, is it really worth it? Is being fast,
and having gigabyte XML processing, mutually compatible tasks... Who is
my target? What audiences are there and which best to choose from?

Perhaps people who want to edit gigabyte XML files, are dreaming?
Validating of multi-gigabyte files, can be done, as a separate
code-base, that's no problem. But what if there is an error? Do I
expect users to edit their gigabyte XML files with a text editor, or
using my graphical editor?

Maybe I should make a separate "large file mode"? So that way, I can
concentrate on the task at hand, and not try to make one thing be two
things? I can refactor my existing code to handle parsing, without too
much of a problem at all. I'd just need to write another editing mode,
thats all. Or maybe just suggest they use a different text editor...
I'm not sure really about trying to write a text editor that can handle
gigabyte files!

__Just thinking out loud__

Once again, this is really just me thinking aloud. Even if no one
answers, already writing this, in the aim for people to understand,
this helps me a lot get my thoughts straight!

--
     Theodore H. Smith - Macintosh Consultant / Contractor.
     My website: <www.elfdata.com/>

Next message: John Cowan: "Re: Newbie Question - what are all those duplicated characters FO R?"
Previous message: Kent Karlsson: "RE: Newbie Question - what are all those duplicated characters FOR?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 07:35:36 EDT