Following the pattern of several other multi-threaded UNIX® kernels, FreeBSD deals with interrupt handlers by giving them their own thread context. Providing a context for interrupt handlers allows them to block on locks. To help avoid latency, however, interrupt threads run at real-time kernel priority. Thus, interrupt handlers should not execute for very long to avoid starving other kernel threads. In addition, since multiple handlers may share an interrupt thread, interrupt handlers should not sleep or use a sleepable lock to avoid starving another interrupt handler.
The interrupt threads currently in FreeBSD are referred to as heavyweight interrupt threads. They are called this because switching to an interrupt thread involves a full context switch. In the initial implementation, the kernel was not preemptive and thus interrupts that interrupted a kernel thread would have to wait until the kernel thread blocked or returned to userland before they would have an opportunity to run.
To deal with the latency problems, the kernel in FreeBSD has been made preemptive. Currently, we only preempt a kernel thread when we release a sleep mutex or when an interrupt comes in. However, the plan is to make the FreeBSD kernel fully preemptive as described below.
Not all interrupt handlers execute in a thread context. Instead, some handlers
execute directly in primary interrupt context. These interrupt handlers are
currently misnamed “fast” interrupt handlers since the INTR_FAST
flag used in earlier versions of the kernel is
used to mark these handlers. The only interrupts which currently use these types of
interrupt handlers are clock interrupts and serial I/O device interrupts. Since
these handlers do not have their own context, they may not acquire blocking locks
and thus may only use spin mutexes.
Finally, there is one optional optimization that can be added in MD code called lightweight context switches. Since an interrupt thread executes in a kernel context, it can borrow the vmspace of any process. Thus, in a lightweight context switch, the switch to the interrupt thread does not switch vmspaces but borrows the vmspace of the interrupted thread. In order to ensure that the vmspace of the interrupted thread does not disappear out from under us, the interrupted thread is not allowed to execute until the interrupt thread is no longer borrowing its vmspace. This can happen when the interrupt thread either blocks or finishes. If an interrupt thread blocks, then it will use its own context when it is made runnable again. Thus, it can release the interrupted thread.
The cons of this optimization are that they are very machine specific and complex and thus only worth the effort if their is a large performance improvement. At this point it is probably too early to tell, and in fact, will probably hurt performance as almost all interrupt handlers will immediately block on Giant and require a thread fix-up when they block. Also, an alternative method of interrupt handling has been proposed by Mike Smith that works like so:
Each interrupt handler has two parts: a predicate which runs in primary interrupt context and a handler which runs in its own thread context.
If an interrupt handler has a predicate, then when an interrupt is triggered, the predicate is run. If the predicate returns true then the interrupt is assumed to be fully handled and the kernel returns from the interrupt. If the predicate returns false or there is no predicate, then the threaded handler is scheduled to run.
Fitting light weight context switches into this scheme might prove rather complicated. Since we may want to change to this scheme at some point in the future, it is probably best to defer work on light weight context switches until we have settled on the final interrupt handling architecture and determined how light weight context switches might or might not fit into it.
Kernel preemption is fairly simple. The basic idea is that a CPU should always be doing the highest priority work available. Well, that is the ideal at least. There are a couple of cases where the expense of achieving the ideal is not worth being perfect.
Implementing full kernel preemption is very straightforward: when you schedule a thread to be executed by putting it on a run queue, you check to see if its priority is higher than the currently executing thread. If so, you initiate a context switch to that thread.
While locks can protect most data in the case of a preemption, not all of the
kernel is preemption safe. For example, if a thread holding a spin mutex preempted
and the new thread attempts to grab the same spin mutex, the new thread may
spin forever as the interrupted thread may never get a chance to execute. Also,
some code such as the code to assign an address space number for a process during
exec
on the Alpha needs to not be preempted as
it supports the actual context switch code. Preemption is disabled for these code
sections by using a critical section.
The responsibility of the critical section API is to prevent context switches
inside of a critical section. With a fully preemptive kernel, every setrunqueue
of a thread other than the current thread is a
preemption point. One implementation is for critical_enter
to set a per-thread flag that is cleared by
its counterpart. If setrunqueue
is called with this
flag set, it does not preempt regardless of the priority of the new thread
relative to the current thread. However, since critical sections are used in spin
mutexes to prevent context switches and multiple spin mutexes can be acquired,
the critical section API must support nesting. For this reason the current
implementation uses a nesting count instead of a single per-thread flag.
In order to minimize latency, preemptions inside of a critical section are deferred rather than dropped. If a thread that would normally be preempted to is made runnable while the current thread is in a critical section, then a per-thread flag is set to indicate that there is a pending preemption. When the outermost critical section is exited, the flag is checked. If the flag is set, then the current thread is preempted to allow the higher priority thread to run.
Interrupts pose a problem with regards to spin mutexes. If a low-level interrupt
handler needs a lock, it needs to not interrupt any code needing that lock to avoid
possible data structure corruption. Currently, providing this mechanism is
piggybacked onto critical section API by means of the cpu_critical_enter
and cpu_critical_exit
functions. Currently this API disables
and re-enables interrupts on all of FreeBSD's current platforms. This approach may
not be purely optimal, but it is simple to understand and simple to get
right. Theoretically, this second API need only be used for spin mutexes that are
used in primary interrupt context. However, to make the code simpler, it is used
for all spin mutexes and even all critical sections. It may be desirable to
split out the MD API from the MI API and only use it in conjunction with the MI API
in the spin mutex implementation. If this approach is taken, then the MD API
likely would need a rename to show that it is a separate API.
As mentioned earlier, a couple of trade-offs have been made to sacrifice cases where perfect preemption may not always provide the best performance.
The first trade-off is that the preemption code does not take other CPUs into account. Suppose we have a two CPU's A and B with the priority of A's thread as 4 and the priority of B's thread as 2. If CPU B makes a thread with priority 1 runnable, then in theory, we want CPU A to switch to the new thread so that we will be running the two highest priority runnable threads. However, the cost of determining which CPU to enforce a preemption on as well as actually signaling that CPU via an IPI along with the synchronization that would be required would be enormous. Thus, the current code would instead force CPU B to switch to the higher priority thread. Note that this still puts the system in a better position as CPU B is executing a thread of priority 1 rather than a thread of priority 2.
The second trade-off limits immediate kernel preemption to real-time priority kernel threads. In the simple case of preemption defined above, a thread is always preempted immediately (or as soon as a critical section is exited) if a higher priority thread is made runnable. However, many threads executing in the kernel only execute in a kernel context for a short time before either blocking or returning to userland. Thus, if the kernel preempts these threads to run another non-realtime kernel thread, the kernel may switch out the executing thread just before it is about to sleep or execute. The cache on the CPU must then adjust to the new thread. When the kernel returns to the preempted thread, it must refill all the cache information that was lost. In addition, two extra context switches are performed that could be avoided if the kernel deferred the preemption until the first thread blocked or returned to userland. Thus, by default, the preemption code will only preempt immediately if the higher priority thread is a real-time priority thread.
Turning on full kernel preemption for all kernel threads has value as a debugging aid since it exposes more race conditions. It is especially useful on UP systems were many races are hard to simulate otherwise. Thus, there is a kernel option FULL_PREEMPTION to enable preemption for all kernel threads that can be used for debugging purposes.
Simply put, a thread migrates when it moves from one CPU to another. In a
non-preemptive kernel this can only happen at well-defined points such as when
calling msleep
or returning to userland. However,
in the preemptive kernel, an interrupt can force a preemption and possible migration
at any time. This can have negative affects on per-CPU data since with the exception
of curthread
and curpcb
the data can change whenever you migrate. Since you can potentially migrate at
any time this renders unprotected per-CPU data access rather useless. Thus it is
desirable to be able to disable migration for sections of code that need
per-CPU data to be stable.
Critical sections currently prevent migration since they do not allow context switches. However, this may be too strong of a requirement to enforce in some cases since a critical section also effectively blocks interrupt threads on the current processor. As a result, another API has been provided to allow the current thread to indicate that if it preempted it should not migrate to another CPU.
This API is known as thread pinning and is provided by the scheduler. The API
consists of two functions: sched_pin
and sched_unpin
. These functions manage a per-thread nesting
count td_pinned
. A thread is pinned when its nesting
count is greater than zero and a thread starts off unpinned with a nesting count of
zero. Each scheduler implementation is required to ensure that pinned threads
are only executed on the CPU that they were executing on when the sched_pin
was first called. Since the nesting count is only
written to by the thread itself and is only read by other threads when the
pinned thread is not executing but while sched_lock
is held, then td_pinned
does not need any locking. The sched_pin
function increments the nesting count and sched_unpin
decrements the nesting count. Note that these
functions only operate on the current thread and bind the current thread to the CPU
it is executing on at the time. To bind an arbitrary thread to a specific CPU,
the sched_bind
and sched_unbind
functions should be used instead.
The timeout
kernel facility permits kernel services
to register functions for execution as part of the softclock
software interrupt. Events are scheduled based on
a desired number of clock ticks, and callbacks to the consumer-provided function
will occur at approximately the right time.
The global list of pending timeout events is protected by a global spin mutex,
callout_lock
; all access to the timeout list must be
performed with this mutex held. When softclock
is
woken up, it scans the list of pending timeouts for those that should fire. In
order to avoid lock order reversal, the softclock
thread will release the callout_lock
mutex when
invoking the provided timeout
callback function.
If the CALLOUT_MPSAFE
flag was not set during
registration, then Giant will be grabbed before invoking the callout, and then
released afterwards. The callout_lock
mutex will be
re-grabbed before proceeding. The softclock
code
is careful to leave the list in a consistent state while releasing the mutex. If
DIAGNOSTIC
is enabled, then the time taken to execute
each function is measured, and a warning is generated if it exceeds a
threshold.