[off] XML. And RAM

From: Theodore H. Smith (delete@elfdata.com)
Date: Mon Aug 11 2003 - 06:49:31 EDT

  • Next message: John Cowan: "Re: Newbie Question - what are all those duplicated characters FO R?"

    This is a really long email, I realised, so I'll section it.

    Sorry about it being off topic. My ElfData plugin does extensively use
    Unicode, though. You can see for yourself, at www.elfdata.com/plugin/ ,
    and XML is Unicode also. I just thought that seeing as IBM people are
    here, they are usually good in large scale solutions...

    __aim summary__

    I'm thinking about formalising the RAM management in my ElfData (string
    processing) plugin, to let it handle processing, and writing to files
    larger than is kept in the RAM. Lets say I want to process a 2GB file,
    but I want to do it with only 1MB allocated in RAM. That kind of thing.

    __answers please!__

    My question, is does anyone know information about this kind of domain,
    on the internet? Or has anyone dealt with this, before?

    We all know programs, on OS9, that can handle huge amounts of data,
    with little RAM. They do this by only keeping a small piece of it in
    the RAM, at a time.

    __What should my plugin do, why, and what is "it's place"__

    Now, I've pretty much come to the conclusion that it's not my plugin's
    place to do the file system interaction, or RAM management. I just
    can't make it do enough for everyone, or even myself. It's usually
    better to let some code be good at one thing, instead of trying to be
    the "everything to everyone".

    However, I do think it's my plugin's place to offer hooks, to let the
    RB code do the RAM management.

    Currently, I have a kind of crude setup, for managing "families", which
    is a consequence of splitting strings by reference (instead of
    splitting by copy which if what RB does).

    I'm wondering if anyone has any kind of advice here? Or even some
    thoughts about how it should be done?

    __Heres some design requirements:__

    * Would like to make my XML Engine capable of processing gigabytes of
    XML, with only a few MB of RAM allocated.
    * Would like to make my XML Editor do the same, except that it also
    will need to display a tree structure on it's editor.
    * Obviously, my ElfData string processing plugin will need some
    improvements to allow this.
    * I don't want to slow down my ElfData plugin noticeably, if possible.
    Or make it over complex.

    This has been done before! I am sure that programs like gcc, and
    CodeWarrior, compile files larger than can be stored in the RAM,
    right??? Or am I wrong?

    __Speculative solutions__

    Heres my thoughts on the issue, some tentative design speculations.
    Probably incomplete speculations that will need refinement.

    So, basically, it's about resources. Resources need to be formalised.
    Instead of just having some loosely defined system of allocating data,
    and disposing it when there are no references, I should formalise my
    access to resources. Currently, it's hard to get an idea of what is
    allocated where. One byte reference to an ElfData object, can stop a
    multi megabyte block of RAM being deallocated.

    Lets say I have a resource, thats a 2gb XML file. I want to do stuff to
    this XML file.

    1) Parse, and validate well-formedness
    2) validate via DTD or schema
    3) Manipulate, and view graphically
    4) Edit as a text file
    5) Save back to the hard disk

    Currently, I have two options. RB's way (split by copy), and my way
    (split by reference). The ElfData plugin can split by copy, also... I
    just want to avoid this.

    __Managing RAM storage and disposal__

    Heres a hypothetical design flaw. I don't know of a real design flaw
    like this, because if I did, I'd fix it, but I want my app to even
    manage cases where I do miss a piece here and there...

    Now, lets say I store something from a file, somewhere in my app. I'll
    say my design flaw was copying to an internal clipboard. So now, if the
    user closes the file, the 2GB of data will still be allocated, because
    of one reference to a tiny part of it.

    I'm thinking, that the best way to deal with that, is by making some
    kind of "ElfDataResource" class. This one, would manage splitting, and
    saving references to ElfData objects. So basically, each time I split,
    I append a reference to that object in my ElfDataResource (EDR) class.
    So, when my ElfData is disposed, it should tell the EDR class to remove
    it's reference. Doing that fast enough to not interferre with
    splitting, already is a bit of a technical problem.

    When the EDR class is told to close the resource, it should update all
    the ElfData objects, that haven't been disposed still, to contain a
    copy of the data, instead of the original. The EDR class should also
    give me statistics on RAM management. That can really help.

    Thats what I mean by "formalising" my RAM usage.

    But then that still doesn't deal with how to process only a tiny part
    of my data in the RAM, and keep most on disk....

    __Speculative: My own application's/libary's practical application__

    I'm not sure how best to go about this. My guess is this is more of a
    "database" kind of thing. IE, fast access to a disk. Lets say I wanted
    to use something like Valentina (rumoured to be very fast), to do most
    of my work, so that I don't have to re-write the wheel. That would only
    be for stuff like storing my graphical editor's interface, mind, not
    the text. Another idea might be to have a different kind of paradigm
    for browsing my XML. Maybe something more like the filesystem? The
    Mac's file system doesn't need to contain a whole hard disk in RAM to
    be able to let me navigate it's files, so maybe my XML could take a
    similar approach?? It's an idea....

    For reading the data in from the hard disk, without putting the whole
    file in RAM, that's another problem. I'm not sure how to do it
    really.... I think with XML, it's not so big a problem. Everything is
    designed with tags. One tag itself is almost ALWAYS small (never seen a
    1kb tag even). The bits inbetween tags, may be very large, i'm sure
    there could be 1mb or more text inside a tag. elements may contain more
    elements, but that's not a problem. So I guess I could break down my
    problem along those lines. It shouldn't be too hard.

    Another requirement is how to store my XML objects... Should I put the
    data back onto a Valentina-like data base? Or read them directly from
    the file??

    __Is it worth it!!__

    At some point I have to think, is it really worth it? Is being fast,
    and having gigabyte XML processing, mutually compatible tasks... Who is
    my target? What audiences are there and which best to choose from?

    Perhaps people who want to edit gigabyte XML files, are dreaming?
    Validating of multi-gigabyte files, can be done, as a separate
    code-base, that's no problem. But what if there is an error? Do I
    expect users to edit their gigabyte XML files with a text editor, or
    using my graphical editor?

    Maybe I should make a separate "large file mode"? So that way, I can
    concentrate on the task at hand, and not try to make one thing be two
    things? I can refactor my existing code to handle parsing, without too
    much of a problem at all. I'd just need to write another editing mode,
    thats all. Or maybe just suggest they use a different text editor...
    I'm not sure really about trying to write a text editor that can handle
    gigabyte files!

    __Just thinking out loud__

    Once again, this is really just me thinking aloud. Even if no one
    answers, already writing this, in the aim for people to understand,
    this helps me a lot get my thoughts straight!

    --
         Theodore H. Smith - Macintosh Consultant / Contractor.
         My website: <www.elfdata.com/>
    


    This archive was generated by hypermail 2.1.5 : Mon Aug 11 2003 - 07:35:36 EDT