Storage by the TUNES LLL

[ Up | LLL | Sitemap | CLiki ]

The current specific mappings being worked on are for the i386 and O'TOP subprojects.

Goals and Requirements

Storage in a high-level extensible distributed persistent system is a very tricky problem: This problem is actually a specific instance of the problems tackled in the Migration subproject. But here we document the choices made in our LLL.

Rough Draft

Persitency is really a tough thing to do well. Because we want secure, resiliant persistency, this means we have to synchronize objects: if objects A and B depend on each other, matching versions must be saved before together so the system can reload from store in case of crash. This may look simple, because it is trivial on a single-threaded machine. Now, TUNES is a distributed, parallel system, and this makes it quite harder.

Because we want the system's distribution to be as large as possible, the simple algorithm "synchronize everything to a central clock" is not feasible: there would always be a crash somewhere before the whole world synchronizes. Such an algorithm is always local. Thus we shall use such perfect synchronization when we know it's possible for some local set of synchronizing objects. For larger sets of objects, we must use conservative means.

Actually, this problem is exactly a garbage collecting problem. Just that objects are (logically) version-tagged so that indeed we are garbage collecting a view in a space of constants.

Storage in a high-level extensible distributed persistent system is a very tricky problem: This problem is actually a specific instance of the problems tackled in the Migration subproject. But here we document the choices made in our LLL.

To begin with, we'll use memory zones loaded at constant place, e.g. a 512MB virtual zone @ 0x40000000 on i386 and OTOP/i386 systems (may vary on other systems). Garbage collection will be very simple, with a unique heap of objects with fixed encoding, with a simple escape mechanism: integers and various kinds of pointers will be differentiated from their lowest and highest bits; to check the "type" of a cell relatively to the gc, you just need check parity, sign, and overflow (shifting left and right could be usual ways to adjust pointers). An overflow manager could resolve special pointers.

To make things simple, we'll use cooperative scheduling, using the stack pointer as a heap pointer, with the convention that at schedule time, the stack is clean and well-framed, and consistent for other threads to use. Before to allocate more than X pages of heap memory, it would be the responsibility of the caller to touch pages down there. Or more probably, this would be done by the allocating routine. Half the (real or virtual) memory will be used by the heap. Once the half limit is almost reached, we use some stop and copy GC method. Another, non-gc heap grows on the other side for real-time and special

For this simple implementation, we shouldn't rely strongly on memory being virtual, just offer some small optimizations when possible (e.g. remapping everything back to the initial address instead of moving the logical address or copying the physical data). This way, we are much more portable; and posixish virtual memory sucks anyway: no efficient mmaping around without large tables, slow syscall for each operation, slow non-standard SIGTRAP recovery, etc. We can always enhance later.

When TUNES is fully bootstrapped, we can meta-encode everything in nicer and more efficient or portable ways. Meta-encode means we can find generic ways to encode, and try them in many combinations, until we find which suits our application best.

Because objects may be migrated, it is most important to keep a canonical representation of objects, or be able to generate one at any moment. Of course, the first and most simple way will be used for the moment. Particularly, we must distinguish in an object which links are essential, and which link are incidental. For example, and characteristically, when I consider a unique object that I visualize, the visualizing device is not essential, and may be migrated; but particularly if the object contains real-time animations (e.g. video game), I do want it to be specialized for the particular current device, to achieve respectable speed; only I want this specialization to be reversible.

Synchronization will be done by actually creating a new object for each version, and remembering it until we're sure a "better" object is known (that is, an object more evaluated, and fully synchronized). The algorithm is simple: each group of co-synchronizing object chooses a "leader" as the object with most simple ID (there must be world-wide conventions for such a criterion).

One way to do things would be to use two mmap() files instead than one. That would be neater but much slower, as we'd have to accept a multi-MB copying delay at checkpoints, because files can't share blocks, and blocks cannot be moved along a file's block mappings.

So the right way to do things is to maintain ourselves a list of mappings between blocks and memory, not really using the underlying system's virtual memory thingy if any (well, it's used transparently so a TUNES process can coexist with other processes). It seems that somehow, we'll reimplement virtual memory ourselves.

Then, two files is only a way to trade efficiency for simplicity, and if we're doing things seriously, we'll have to maintain a list of block<->memory mappings, and because of POSIX deficiencies, we'll use read() and write() instead of mmap() (which isn't fine-grained), and we won't be able to share pages (at least, we'll need to run even if sharing is not available).

Sketch for the initial implementation

Garbage Collection



OTOP tricks for persistence:

Multithreading and Locking

Resource-tracking and quotas

To Do

This document last modified on Sunday, 29-Oct-2006 13:02:08 PST. See the Changelog