Preemption and CooperationAn essay by Fare covering an area of the LLL concepts.
In traditional systems (as well as in emulation boxes for them), all software is so completely and deeply unsecure that only uncooperative preemption is possible for the system not to crash too often (as examples of often-crashing cooperative non-preemptive system are Windows 3.1 and MacOS 6.x).
Now, TUNES is a secure system, and we can and shall use cooperation whenever possible (=almost always): we shall require standard compilers to generate cooperative code, that
PAUSEs every now and then.
Of course, real-time threads need some kind of preemption, and untrusted environments (which emulation boxes are) as well; but as long as these use cannot access the preempted thread's data, or any resource shared by the preempted thread (like garbage collected memory), everything is fine.
Actually, threads are interrupted and signaled by more priviledged threads, but not preempted by equally-priviledged threads. (they might be preempted by foreign threads, to which they can't compare priviledge). This is why we rather call that interruptible cooperative threads instead of preemptible threads.
Now, a common problem with cooperation is that
PAUSEs are often too irregular, and while having them too far away for each other yields poor response time, having them to near yields too many context switches and related overhead (even in the best case, lots of cache misses). We can solve this problem by requiring some fine
PAUSE resolution, but do actual context switch only when the timer says so (by modifying the code of a
PAUSE code to just a
NEXT until it's time).
This is perfect, but may slow tight loops significantly. One way is to unroll tight loops enough so that this cost is reduced. If this still isn't satisfactory, a more elaborate way involves a "it's time to switch" signal. Before to enter a tight loop, some "interrupt recovery" routine is registered as the signal handler (and is unregistered afterwards); if an interrupt happens while in the tight loop, then the recovery routine is called, which has the responsibility to put the system in a state that follows the usual calling conventions.
Typically, the routine will continue normal execution of the loop until some modified point which will jump back into the remaining of the recovery routine, which can finish the work because it knows the exact point of execution in the loop (which could have been at any point inside the loop).
This modified loop might be achieved through safely self-modifying code, or safely runtime-generated code, if the CPU architectures doesn't make it excessively costly by e.g. requiring to flush all of a big instruction cache afterward; in the case where expensive cache-flushing makes runtime code generation/modification too expensive, a pre-modified copy of the code could be used, which trades memory to gain speed (usually, whoever can afford bloated hardware with large instruction caches can afford some more memory, too, and this only needs be done for tight loops).
I realize that there might be problems in some cases, like page faults, when the recovery routine just can't complete until the interrupt handler returns. This means that either the handler (and all its dependencies) must live in a lower-level space and doesn't need the GC invariants provided by interruptible code, or the recovery routine must be able to roll-back rather than push-forward execution, or critical routines must lock pages in before to enter recoverable mode, or be in the same page as the code being recovered (so that whenever the latter is run present in-core, so is the former) or any combination of the previous.
In all cases, this makes the interrupt handling more complicated, because it has to determine the recovery routine and jump in it. But that's no run-time performance problem since interrupts are a most infrequent case as compared to normal execution. Because of the space/time cost of recovery routines, they shall only be used for time-critical loops and recursive code, where inserting
PAUSEs would be too costly. Whereever a tight loop involves a significant amount of CPU work, the added cost for such a signal handler is negligible; when the loop doesn't involve significant CPU work, then it can be coded in a way that includes a check for context-swich time, so recovery is just toggling a flag, and jumping back to the unmodified original code, that will do everything perfectly.
This isn't a development-time hassle either, because all this code can be automatically produced by the compiler. Only the compiler-writer and low-level programmer have to worry about it - and at worst, the latter could still resort to traditional virtual machine protection, if it really were more suited to their particular case.
Currently, even the Linux kernel uses a restricted version of this technique, developed independently by Linus since version 2.1.8: Software checks for the validity of user-memory access from the kernel are replaced by hardware check with the exception handler using a simple code recovery technique. And of course, most of the complex code is automatically generated from a few high-level macros.
Machines with lots of registers or register banks may have disjoint registers/banks for real-time and normal threads.
Note that this way to handle thread scheduling doesn't interfere anyhow with critical real-time interrupt handling, that wouldn't wait for the scheduled thread to recover, but would take place before the recovery routine is called. Of course this means that there must be multiple slightly-different code generation strategies, depending on the real-time nature and other constraints of some code. But this is all to be handled automatically -- including the automatic mirroring or migration of library routines used in several different contexts.
Hence, there could be a hierarchy of threads, some thread being able to preempt some other, while these other threads cooperate with each other. A thread could be made of subthreads, that would cooperate with each other or not, etc.
Let us give a simple argument why some kind of cooperation or other is always needed: there are always lots of extensions and resources that appear and disappear dynamically, every thread using few of them. A preempter that would pretend to keep track of them all when preemptively saving the context would either do lots of costly useless work (e.g. saving the state of the graphics cards), or fail to manage resources. The preempted thread knows better what needs to be saved or not ("drop that screen -- I'll redraw it faster than you can dump&save it, if I count all the page misses and swapping you'll obtain!"). Surely, lazy saving and shallow-binding techniques can (and should) be used to reduce the overhead of thread switching. But as in other domains where it is used, lazy evaluation is known to be always less efficient whenever the evaluation is know to be strictly required. Shallow binding involves a permanent run-time overhead. And all these techniques won't solve the problem of atomicity with respect to GC operations, etc.
It could be said that the above design naturally stems from the very first principle of cybernetics: information is what allows to take better decisions; hence decisions are better taken by those who do have information (efficient state saving by threads themselves), and the general system design should encourage cooperation between agents and publication of information over blind law enforcement among uncooperating agents.
This page is linked from: No-Kernel