An *(essay) by _(Fare) covering an area of the _(LLL concepts| http://tunes.org/new/LLL/).

In traditional systems (as well as in emulation boxes for them), all software is so completely and deeply unsecure that only uncooperative preemption is possible for the system not to crash too often (as examples of often-crashing cooperative non-preemptive system are Windows 3.1 and MacOS 6.x).

Now, TUNES is a secure system, and we can and shall use cooperation whenever possible (=almost always): we shall require standard compilers to generate cooperative code, that <tt>PAUSE</tt>s every now and then.

Of course, real-time threads need some kind of preemption,
and untrusted environments (which emulation boxes are) as well;
but as long as these use cannot access the preempted thread's data,
or any resource shared by the preempted thread
(like garbage collected memory), everything is fine.

Actually, threads are interrupted and signaled by more priviledged threads,
but not preempted by equally-priviledged threads.
(they might be preempted by foreign threads, to which they can't
compare priviledge).
This is why we rather call that interruptible cooperative threads
instead of preemptible threads.

Now, a common problem with cooperation is that <tt>PAUSE</tt>s are often too
irregular, and while having them too far away for each other yields poor
response time, having them to near yields too many context switches and
related overhead (even in the best case, lots of cache misses). We can
solve this problem by requiring some fine <tt>PAUSE</tt> resolution, but do actual context switch only when the timer says so (by modifying the code
of a <tt>PAUSE</tt> code to just a <tt>RET</tt>/<tt>NEXT</tt> until it's time).

This is perfect, but may slow tight loops significantly.
One way is to unroll tight loops enough so that this cost is reduced.
If this still isn't satisfactory,
a more elaborate way involves a "it's time to switch" signal.
Before to enter a tight loop, some "interrupt recovery" routine
is registered as the signal handler (and is unregistered afterwards);
if an interrupt happens while in the tight loop,
then the recovery routine is called,
which has the responsibility to put the system
in a state that follows the usual calling conventions.

Typically, the routine will continue normal execution of the loop
until some modified point which will jump back
into the remaining of the recovery routine,
which can finish the work
because it knows the exact point of execution in the loop
(which could have been at any point inside the loop).

This modified loop might be achieved through safely self-modifying code,
or safely runtime-generated code,
if the CPU architectures doesn't make it excessively costly
by e.g. requiring to flush all of a big instruction cache afterward;
in the case where expensive cache-flushing
makes runtime code-generation/modification too expensive,
a pre-modified copy of the code could be used,
which trades memory to gain speed
(usually, whoever can afford bloated hardware
with large instruction caches can afford some more memory, too,
and this only needs be done for tight loops).

I realize that there might be problems in some cases,
like page faults, when the recovery routine just can't complete
until the interrupt handler returns.
This means that either the handler (and all its dependencies)
must live in a lower-level space
and doesn't need the GC invariants provided by interruptible code,
or the recovery routine must be able to roll-back
rather than push-forward execution,
or critical routines must lock pages in before
to enter recoverable mode,
or be in the same page as the code being recovered
(so that whenever the latter is run present in-core, so is the former)
or any combination of the previous.

In all cases, this makes the interrupt handling more complicated,
because it has to determine the recovery routine and jump in it.
But that's no run-time performance problem
since interrupts are a most infrequent case
as compared to normal execution.
Because of the space/time cost of recovery routines,
they shall only be used for time-critical loops and recursive code,
where inserting <tt>PAUSE</tt>s would be too costly.
Whereever a tight loop involves a significant amount of CPU work,
the added cost for such a signal handler is negligible;
when the loop doesn't involve significant CPU work,
then it can be coded in a way that includes a check for context-swich time,
so recovery is just toggling a flag, and jumping back to the
unmodified original code, that will do everything perfectly.

This isn't a development-time hassle either,
because all this code can be automatically produced by the compiler.
Only the compiler-writer and low-level programmer have to worry about it
- and at worst, the latter could still resort to traditional
virtual machine protection, if it really were more suited
to their particular case.

Currently, even the Linux kernel uses
a restricted version of this technique,
developed independently by Linus since version 2.1.8:
Software checks for the validity of user-memory access from the kernel
are replaced by hardware check with the exception handler
using a simple code recovery technique.
And of course, most of the complex code is automatically generated
from a few high-level macros.

Machines with lots of registers or register banks
may have disjoint registers/banks for real-time and normal threads.

Note that this way to handle thread scheduling
doesn't interfere anyhow with critical real-time interrupt handling,
that wouldn't wait for the scheduled thread to recover,
but would take place before the recovery routine is called.
Of course this means that there must be multiple slightly-different
code-generation strategies, depending on the real-time nature
and other constraints of some code. But this is all to be handled
automatically -- including the automatic mirroring or migration
of library routines used in several different contexts.

Hence, there could be a hierarchy of threads,
some thread being able to preempt some other,
while these other threads cooperate with each other.
A thread could be made of subthreads,
that would cooperate with each other or not, etc.

Let us give a simple argument why some kind of cooperation or other
is <em>always</em> needed:
there are always lots of extensions and resources
that appear and disappear dynamically,
every thread using few of them.
A preempter that would pretend to keep track of them all
when preemptively saving the context
would either do lots of costly useless work
(e.g. saving the state of the graphics cards),
or fail to manage resources.
The preempted thread knows better what needs to be saved or not
("drop that screen -- I'll redraw it faster than you can dump&save it,
if I count all the page misses and swapping you'll obtain!").
Surely, lazy saving and shallow-binding techniques can (and should) be used
to reduce the overhead of thread switching.
But as in other domains where it is used,
lazy evaluation is known to be always less efficient
whenever the evaluation is know to be strictly required.
Shallow binding involves a permanent run-time overhead.
And all these techniques won't solve the problem of atomicity
with respect to GC operations, etc.

It could be said that the above design naturally stems
from the very first principle of cybernetics:
information is what allows to take better decisions;
hence decisions are better taken by those who do have information
(efficient state saving by threads themselves),
and the general system design should encourage
cooperation between agents and publication of information
over blind law enforcement among uncooperating agents.
