Storage by the TUNES LLL

The current specific mappings being worked on are for the i386 and O'TOP subprojects.

Goals and Requirements

Storage in a high-level extensible distributed persistent system is a very tricky problem: This problem is actually a specific instance of the problems tackled in the Migration subproject. But here we document the choices made in our LLL.

Rough Draft

Persitency is really a tough thing to do well. Because we want secure, resiliant persistency, this means we have to synchronize objects: if objects A and B depend on each other, matching versions must be saved before together so the system can reload from store in case of crash. This may look simple, because it is trivial on a single-threaded machine. Now, TUNES is a distributed, parallel system, and this makes it quite harder.

Because we want the system's distribution to be as large as possible, the simple algorithm "synchronize everything to a central clock" is not feasible: there would always be a crash somewhere before the whole world synchronizes. Such an algorithm is always local. Thus we shall use such perfect synchronization when we know it's possible for some local set of synchronizing objects. For larger sets of objects, we must use conservative means.

Actually, this problem is exactly a garbage collecting problem. Just that objects are (logically) version-tagged so that indeed we are garbage collecting a view in a space of constants.

To begin with, we'll use memory zones loaded at constant place, e.g. a 512MB virtual zone @ 0x40000000 on i386 and OTOP/i386 systems (may vary on other systems). Garbage collection will be very simple, with a unique heap of objects with fixed encoding, with a simple escape mechanism: integers and various kinds of pointers will be differentiated from their lowest and highest bits; to check the "type" of a cell relatively to the gc, you just need check parity, sign, and overflow (shifting left and right could be usual ways to adjust pointers). An overflow manager could resolve special pointers.

To make things simple, we'll use cooperative scheduling, using the stack pointer as a heap pointer, with the convention that at schedule time, the stack is clean and well-framed, and consistent for other threads to use. Before to allocate more than X pages of heap memory, it would be the responsibility of the caller to touch pages down there. Or more probably, this would be done by the allocating routine. Half the (real or virtual) memory will be used by the heap. Once the half limit is almost reached, we use some stop and copy GC method. Another, non-gc heap grows on the other side for real-time and special

For this simple implementation, we shouldn't rely strongly on memory being virtual, just offer some small optimizations when possible (e.g. remapping everything back to the initial address instead of moving the logical address or copying the physical data). This way, we are much more portable; and posixish virtual memory sucks anyway: no efficient mmaping around without large tables, slow syscall for each operation, slow non-standard SIGTRAP recovery, etc. We can always enhance later.

When TUNES is fully bootstrapped, we can meta-encode everything in nicer and more efficient or portable ways. Meta-encode means we can find generic ways to encode, and try them in many combinations, until we find which suits our application best.

Because objects may be migrated, it is most important to keep a canonical representation of objects, or be able to generate one at any moment. Of course, the first and most simple way will be used for the moment. Particularly, we must distinguish in an object which links are essential, and which link are incidental. For example, and characteristically, when I consider a unique object that I visualize, the visualizing device is not essential, and may be migrated; but particularly if the object contains real-time animations (e.g. video game), I do want it to be specialized for the particular current device, to achieve respectable speed; only I want this specialization to be reversible.

Synchronization will be done by actually creating a new object for each version, and remembering it until we're sure a "better" object is known (that is, an object more evaluated, and fully synchronized). The algorithm is simple: each group of co-synchronizing object chooses a "leader" as the object with most simple ID (there must be world-wide conventions for such a criterion).

One way to do things would be to use two mmap() files instead than one. That would be neater but much slower, as we'd have to accept a multi-MB copying delay at checkpoints, because files can't share blocks, and blocks cannot be moved along a file's block mappings.

So the right way to do things is to maintain ourselves a list of mappings between blocks and memory, not really using the underlying system's virtual memory thingy if any (well, it's used transparently so a TUNES process can coexist with other processes). It seems that somehow, we'll reimplement virtual memory ourselves.

Then, two files is only a way to trade efficiency for simplicity, and if we're doing things seriously, we'll have to maintain a list of block<->memory mappings, and because of POSIX deficiencies, we'll use read() and write() instead of mmap() (which isn't fine-grained), and we won't be able to share pages (at least, we'll need to run even if sharing is not available).

Sketch for the initial implementation

Garbage Collection

Objects will be grouped by page-aligned "segments" of similarly sized objects. rounding the size up to keep only 2-5 significant bits, as a compromise to limit the space wasted as padding while not overmultiplying the number of groups, which would increase both lookup time and swiss-cheese syndrom.
Orthogonally to size grouping, objects will be grouped as being read-only or not, linear or not, containing pointers or not, having a destructor or not. Some of these attributes might be perobject meta information instead.
Grouping objects is believed to solve most swiss-cheese syndrom related problems without the need to compact memory. Still, compacting can be done during "major GCs", once a day.
Meta-data can be kept as one byte per object, unless we choose treadmill like methods.
Meta-information might be kept offpage, apart from data, so as not to uselessly fill cache with random data during GC, depending on the the nature of the objects; this is particularly effective when people allocate whole pages, btw. It can be
A "first generation" heap will be done as a stack, perhaps using the hbaker92CONS LazyAlloc paper ideas, or just the standard Appel thingy. Hence, allocation of short-lived objects will be fast, and can be done purely on registers, without going through all the hassle implied by above plans that needs lots of metadata memory access.
We could require all objects to be read-only and/or linear, but for a special reference type; this would ease object many things (object identity, write barrier, and many other things).
Particularly, back-pointers from non-linear objects to linear ones can be done only through a special set of such reference objects anyway.

Notes:

destructors, weak pointers, migration handlers, are all particular cases of special semantics to execute at special memory management events, like reclaiming of object space, writing the checkpoint, restoring from checkpoint, migrating in or out. They should be provided a uniform interface, but I don't see how it can be made but in ad-hoc ways.
A copying GC would color destructible objects, and activate destructors if they were not triggered since one flip/run/etc.
Generic functions to access objects might be very costly, due to various read or write barriers. However, low-level code can be specialized to jump over the barrier, or pay part of the fee only, when the high-level language (or low-level hacker) could prove that the (whole) barrier isn't needed.
There are lots of things whose average case is good while the worst case wastes a lot of space. Such space should be *reserved* in virtual memory, so as to ensure the system won't crash even on worst-case conditions; but the system will be tuned to work best on average conditions.
the exact details of the above plan, as suit most our programs, can only be determined through experimentation. Particularly note that it is independent from the GC method being used (mark&sweep, mark&don't sweep, stop©, hbaker rtgc, etc).
The format for the stack (==first generation) could very well be used as a portable format for manipulating objects, in association with a table for objects violating the well-ordering. Or perhaps doing like postscript and having a stack-language to generate would suffice; this language would be devoid of any loop (but the structures it creates can be recursive), and have drastic limitations as for ways to create violations of the stack order.

Persistence

Persistence can be done at the page level.
Checkpoints will be triggered manually or by the clock; when triggered, a checkpoint will wait for next minor GC. a checkpoint can also be forced, in which case it will be done at next safe point (that is, almost in real-time), with or without triggering GC; a timeout will transform triggered checkpoint into forced checkpoint, perhaps by forcing a GC.
A checkpoint logs the modified pages; it first saves the metadata for the pages, then writes a the contents of the pages compressed with an appropriate algorithm, that uses the metadata as a hint, but doesn't waste too much time at that, either, so that checkpointing performance stays disk-driven.
If the above code is well written, a major GC could be achieved by restoring a checkpoint just after saving it!!!
Checkpointing can be done concurrently with running if we can control paging and hide the pages with copy-on-write. OTOP checkpointing must block the process :( :( :(

OTOP tricks for persistence:

If system dirty page bits are not available (e.g. OTOP), then they should be manually emulated by software, with a write barrier :( :(
BEFORE a checkpoint, the previous checkpoint will be committed in case it hasn't been yet. If it wasn't committed, then the checkpoint before it was still valid, so everything's fine. This allows the process to continue to run after checkpoint writes were scheduled, unless we *really* need to stop.
The problem is that we need two fsync() to commit the changes: one for all the data, one for atomically committing the super-block after everything's done. What we may do, if not satisfied with the previous checkpoint, is to schedule the fsync() after we let some time to the unix kernel, so that we can hope that fsync() won't stop the process.

Multithreading and Locking

Thread states are cheap continuations.
A lock is actually a unique resource server thread that executes routines that are given to it.
Such locks as higher-order functions provide a clean semantic framework to solve problems of "priority inversion" in threads.

Resource-tracking and quotas

No need to have multiple physical address spaces to achieve multiple logical address spaces. To achieve quotas in resource usage, you just need keep track of how much resources are used by the current resource user(s).
An efficient implementation would:
- keep track of usage synchronously for seldom used resources;
- just update global counters at user-switch time for resources that are constantly evolving
- When there are recursive users, either they might all be updated at user-switch time, or only the deepest users are, and the counters for the others are (implicitly or explicitly) invalidated.
All this can be done by reflectively modifying the GC/threading code, without any special OS support besides generic reflection.

To Do

Do implement.
Document the current implementation.
Think about the the compatibility of real-time objects and securely synchronized objects: real-time generally means that we use fixed-size buffers that we update in place. But then, we must keep a synchronized version (perhaps several of them), so we must have multiple copies of the object, and be sure that copying is atomic (*ouch*).
Point to the GC-FAQ, etc (see page Review/Languages.html)