From: Theodore H. Smith (delete@elfdata.com)
Date: Mon Aug 11 2003 - 06:49:31 EDT
This is a really long email, I realised, so I'll section it.
Sorry about it being off topic. My ElfData plugin does extensively use 
Unicode, though. You can see for yourself, at www.elfdata.com/plugin/ , 
and XML is Unicode also. I just thought that seeing as IBM people are 
here, they are usually good in large scale solutions...
__aim summary__
I'm thinking about formalising the RAM management in my ElfData (string 
processing) plugin, to let it handle processing, and writing to files 
larger than is kept in the RAM. Lets say I want to process a 2GB file, 
but I want to do it with only 1MB allocated in RAM. That kind of thing.
__answers please!__
My question, is does anyone know information about this kind of domain, 
on the internet? Or has anyone dealt with this, before?
We all know programs, on OS9, that can handle huge amounts of data, 
with little RAM. They do this by only keeping a small piece of it in 
the RAM, at a time.
__What should my plugin do, why, and what is "it's place"__
Now, I've pretty much come to the conclusion that it's not my plugin's 
place to do the file system interaction, or RAM management. I just 
can't make it do enough for everyone, or even myself. It's usually 
better to let some code be good at one thing, instead of trying to be 
the "everything to everyone".
However, I do think it's my plugin's place to offer hooks, to let the 
RB code do the RAM management.
Currently, I have a kind of crude setup, for managing "families", which 
is a consequence of splitting strings by reference (instead of 
splitting by copy which if what RB does).
I'm wondering if anyone has any kind of advice here? Or even some 
thoughts about how it should be done?
__Heres some design requirements:__
* Would like to make my XML Engine capable of processing gigabytes of 
XML, with only a few MB of RAM allocated.
* Would like to make my XML Editor do the same, except that it also 
will need to display a tree structure on it's editor.
* Obviously, my ElfData string processing plugin will need some 
improvements to allow this.
* I don't want to slow down my ElfData plugin noticeably, if possible. 
Or make it over complex.
This has been done before! I am sure that programs like gcc, and 
CodeWarrior, compile files larger than can be stored in the RAM, 
right??? Or am I wrong?
__Speculative solutions__
Heres my thoughts on the issue, some tentative design speculations. 
Probably incomplete speculations that will need refinement.
So, basically, it's about resources. Resources need to be formalised. 
Instead of just having some loosely defined system of allocating data, 
and disposing it when there are no references, I should formalise my 
access to resources. Currently, it's hard to get an idea of what is 
allocated where. One byte reference to an ElfData object, can stop a 
multi megabyte block of RAM being deallocated.
Lets say I have a resource, thats a 2gb XML file. I want to do stuff to 
this XML file.
1) Parse, and validate well-formedness
2) validate via DTD or schema
3) Manipulate, and view graphically
4) Edit as a text file
5) Save back to the hard disk
Currently, I have two options. RB's way (split by copy), and my way 
(split by reference). The ElfData plugin can split by copy, also... I 
just want to avoid this.
__Managing RAM storage and disposal__
Heres a hypothetical design flaw. I don't know of a real design flaw 
like this, because if I did, I'd fix it, but I want my app to even 
manage cases where I do miss a piece here and there...
Now, lets say I store something from a file, somewhere in my app. I'll 
say my design flaw was copying to an internal clipboard. So now, if the 
user closes the file, the 2GB of data will still be allocated, because 
of one reference to a tiny part of it.
I'm thinking, that the best way to deal with that, is by making some 
kind of "ElfDataResource" class. This one, would manage splitting, and 
saving references to ElfData objects. So basically, each time I split, 
I append a reference to that object in my ElfDataResource (EDR) class. 
So, when my ElfData is disposed, it should tell the EDR class to remove 
it's reference. Doing that fast enough to not interferre with 
splitting, already is a bit of a technical problem.
When the EDR class is told to close the resource, it should update all 
the ElfData objects, that haven't been disposed still, to contain a 
copy of the data, instead of the original. The EDR class should also 
give me statistics on RAM management. That can really help.
Thats what I mean by "formalising" my RAM usage.
But then that still doesn't deal with how to process only a tiny part 
of my data in the RAM, and keep most on disk....
__Speculative: My own application's/libary's practical application__
I'm not sure how best to go about this. My guess is this is more of a 
"database" kind of thing. IE, fast access to a disk. Lets say I wanted 
to use something like Valentina (rumoured to be very fast), to do most 
of my work, so that I don't have to re-write the wheel. That would only 
be for stuff like storing my graphical editor's interface, mind, not 
the text. Another idea might be to have a different kind of paradigm 
for browsing my XML. Maybe something more like the filesystem? The 
Mac's file system doesn't need to contain a whole hard disk in RAM to 
be able to let me navigate it's files, so maybe my XML could take a 
similar approach?? It's an idea....
For reading the data in from the hard disk, without putting the whole 
file in RAM, that's another problem. I'm not sure how to do it 
really.... I think with XML, it's not so big a problem. Everything is 
designed with tags. One tag itself is almost ALWAYS small (never seen a 
1kb tag even). The bits inbetween tags, may be very large, i'm sure 
there could be 1mb or more text inside a tag. elements may contain more 
elements, but that's not a problem. So I guess I could break down my 
problem along those lines. It shouldn't be too hard.
Another requirement is how to store my XML objects... Should I put the 
data back onto a Valentina-like data base? Or read them directly from 
the file??
__Is it worth it!!__
At some point I have to think, is it really worth it? Is being fast, 
and having gigabyte XML processing, mutually compatible tasks... Who is 
my target? What audiences are there and which best to choose from?
Perhaps people who want to edit gigabyte XML files, are dreaming? 
Validating of multi-gigabyte files, can be done, as a separate 
code-base, that's no problem. But what if there is an error? Do I 
expect users to edit their gigabyte XML files with a text editor, or 
using my graphical editor?
Maybe I should make a separate "large file mode"? So that way, I can 
concentrate on the task at hand, and not try to make one thing be two 
things? I can refactor my existing code to handle parsing, without too 
much of a problem at all. I'd just need to write another editing mode, 
thats all. Or maybe just suggest they use a different text editor... 
I'm not sure really about trying to write a text editor that can handle 
gigabyte files!
__Just thinking out loud__
Once again, this is really just me thinking aloud. Even if no one 
answers, already writing this, in the aim for people to understand, 
this helps me a lot get my thoughts straight!
--
     Theodore H. Smith - Macintosh Consultant / Contractor.
     My website: <www.elfdata.com/>
This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 07:35:36 EDT