Following the pattern of several other multi-threaded UNIX® kernels, FreeBSD deals with interrupt handlers by giving them their own thread context. Providing a context for interrupt handlers allows them to block on locks. To help avoid latency, however, interrupt threads run at real-time kernel priority. Thus, interrupt handlers should not execute for very long to avoid starving other kernel threads. In addition, since multiple handlers may share an interrupt thread, interrupt handlers should not sleep or use a sleepable lock to avoid starving another interrupt handler.
The interrupt threads currently in FreeBSD are referred to as heavyweight interrupt threads. They are called this because switching to an interrupt thread involves a full context switch. In the initial implementation, the kernel was not preemptive and thus interrupts that interrupted a kernel thread would have to wait until the kernel thread blocked or returned to userland before they would have an opportunity to run.
To deal with the latency problems, the kernel in FreeBSD has been made preemptive. Currently, we only preempt a kernel thread when we release a sleep mutex or when an interrupt comes in. However, the plan is to make the FreeBSD kernel fully preemptive as described below.
Not all interrupt handlers execute in a thread context. 	Instead, some handlers
execute directly in primary interrupt 	context. These interrupt handlers are
currently misnamed 	“fast” interrupt handlers since the 	INTR_FAST flag used in earlier versions 	of the kernel is
used to mark these handlers. The only 	interrupts which currently use these types of
interrupt 	handlers are clock interrupts and serial I/O device 	interrupts. Since
these handlers do not have their own 	context, they may not acquire blocking locks
and thus may only 	use spin mutexes.
Finally, there is one optional optimization that can be added in MD code called lightweight context switches. Since an interrupt thread executes in a kernel context, it can borrow the vmspace of any process. Thus, in a lightweight context switch, the switch to the interrupt thread does not switch vmspaces but borrows the vmspace of the interrupted thread. In order to ensure that the vmspace of the interrupted thread does not disappear out from under us, the interrupted thread is not allowed to execute until the interrupt thread is no longer borrowing its vmspace. This can happen when the interrupt thread either blocks or finishes. If an interrupt thread blocks, then it will use its own context when it is made runnable again. Thus, it can release the interrupted thread.
The cons of this optimization are that they are very machine specific and complex and thus only worth the effort if their is a large performance improvement. At this point it is probably too early to tell, and in fact, will probably hurt performance as almost all interrupt handlers will immediately block on Giant and require a thread fix-up when they block. Also, an alternative method of interrupt handling has been proposed by Mike Smith that works like so:
Each interrupt handler has two parts: a predicate which runs in primary interrupt context and a handler which runs in its own thread context.
If an interrupt handler has a predicate, then when an interrupt is triggered, the predicate is run. If the predicate returns true then the interrupt is assumed to be fully handled and the kernel returns from the interrupt. If the predicate returns false or there is no predicate, then the threaded handler is scheduled to run.
Fitting light weight context switches into this scheme might prove rather complicated. Since we may want to change to this scheme at some point in the future, it is probably best to defer work on light weight context switches until we have settled on the final interrupt handling architecture and determined how light weight context switches might or might not fit into it.
Kernel preemption is fairly simple. The basic idea is that a CPU should always be doing the highest priority work available. Well, that is the ideal at least. There are a couple of cases where the expense of achieving the ideal is not worth being perfect.
Implementing full kernel preemption is very straightforward: when you schedule a thread to be executed by putting it on a run queue, you check to see if its priority is higher than the currently executing thread. If so, you initiate a context switch to that thread.
While locks can protect most data in the case of a 	 preemption, not all of the
kernel is preemption safe. For 	 example, if a thread holding a spin mutex preempted
and the 	 new thread attempts to grab the same spin mutex, the new 	 thread may
spin forever as the interrupted thread may never 	 get a chance to execute. Also,
some code such as the code 	 to assign an address space number for a process during
	 exec on the Alpha needs to not be 	 preempted as
it supports the actual context switch code. 	 Preemption is disabled for these code
sections by using a 	 critical section.
The responsibility of the critical section API is to 	 prevent context switches
inside of a critical section. With 	 a fully preemptive kernel, every 	 setrunqueue of a thread other than the 	 current thread is a
preemption point. One implementation is 	 for critical_enter to set a per-thread 	 flag that is cleared by
its counterpart. If 	 setrunqueue is called with this
flag 	 set, it does not preempt regardless of the priority of the new 	 thread
relative to the current thread. However, since 	 critical sections are used in spin
mutexes to prevent 	 context switches and multiple spin mutexes can be acquired,
	 the critical section API must support nesting. For this 	 reason the current
implementation uses a nesting count 	 instead of a single per-thread flag.
In order to minimize latency, preemptions inside of a critical section are deferred rather than dropped. If a thread that would normally be preempted to is made runnable while the current thread is in a critical section, then a per-thread flag is set to indicate that there is a pending preemption. When the outermost critical section is exited, the flag is checked. If the flag is set, then the current thread is preempted to allow the higher priority thread to run.
Interrupts pose a problem with regards to spin mutexes. 	 If a low-level interrupt
handler needs a lock, it needs to 	 not interrupt any code needing that lock to avoid
possible 	 data structure corruption. Currently, providing this 	 mechanism is
piggybacked onto critical section API by means 	 of the cpu_critical_enter and 	 cpu_critical_exit functions. Currently 	 this API disables
and re-enables interrupts on all of 	 FreeBSD's current platforms. This approach may
not be 	 purely optimal, but it is simple to understand and simple to 	 get
right. Theoretically, this second API need only be used 	 for spin mutexes that are
used in primary interrupt context. 	 However, to make the code simpler, it is used
for all spin 	 mutexes and even all critical sections. It may be desirable 	 to
split out the MD API from the MI API and only use it in 	 conjunction with the MI API
in the spin mutex 	 implementation. If this approach is taken, then the MD API 	
likely would need a rename to show that it is a separate 	 API.
As mentioned earlier, a couple of trade-offs have been made to sacrifice cases where perfect preemption may not always provide the best performance.
The first trade-off is that the preemption code does not take other CPUs into account. Suppose we have a two CPU's A and B with the priority of A's thread as 4 and the priority of B's thread as 2. If CPU B makes a thread with priority 1 runnable, then in theory, we want CPU A to switch to the new thread so that we will be running the two highest priority runnable threads. However, the cost of determining which CPU to enforce a preemption on as well as actually signaling that CPU via an IPI along with the synchronization that would be required would be enormous. Thus, the current code would instead force CPU B to switch to the higher priority thread. Note that this still puts the system in a better position as CPU B is executing a thread of priority 1 rather than a thread of priority 2.
The second trade-off limits immediate kernel preemption to real-time priority kernel threads. In the simple case of preemption defined above, a thread is always preempted immediately (or as soon as a critical section is exited) if a higher priority thread is made runnable. However, many threads executing in the kernel only execute in a kernel context for a short time before either blocking or returning to userland. Thus, if the kernel preempts these threads to run another non-realtime kernel thread, the kernel may switch out the executing thread just before it is about to sleep or execute. The cache on the CPU must then adjust to the new thread. When the kernel returns to the preempted thread, it must refill all the cache information that was lost. In addition, two extra context switches are performed that could be avoided if the kernel deferred the preemption until the first thread blocked or returned to userland. Thus, by default, the preemption code will only preempt immediately if the higher priority thread is a real-time priority thread.
Turning on full kernel preemption for all kernel threads has value as a debugging aid since it exposes more race conditions. It is especially useful on UP systems were many races are hard to simulate otherwise. Thus, there is a kernel option FULL_PREEMPTION to enable preemption for all kernel threads that can be used for debugging purposes.
Simply put, a thread migrates when it moves from one CPU 	to another. In a
non-preemptive kernel this can only happen 	at well-defined points such as when
calling 	msleep or returning to userland. 	However,
in the preemptive kernel, an interrupt can force a 	preemption and possible migration
at any time. This can have 	negative affects on per-CPU data since with the exception
of 	curthread and curpcb
the 	data can change whenever you migrate. Since you can 	potentially migrate at
any time this renders unprotected 	per-CPU data access rather useless. Thus it is
desirable to be 	able to disable migration for sections of code that need
	per-CPU data to be stable.
Critical sections currently prevent migration since they do not allow context switches. However, this may be too strong of a requirement to enforce in some cases since a critical section also effectively blocks interrupt threads on the current processor. As a result, another API has been provided to allow the current thread to indicate that if it preempted it should not migrate to another CPU.
This API is known as thread pinning and is provided by the 	scheduler. The API
consists of two functions: 	sched_pin and 	sched_unpin. These functions manage a 	per-thread nesting
count td_pinned. A 	thread is pinned when its nesting
count is greater than zero 	and a thread starts off unpinned with a nesting count of
zero. 	Each scheduler implementation is required to ensure that 	pinned threads
are only executed on the CPU that they were 	executing on when the sched_pin was first 	called. Since the nesting count is only
written to by the 	thread itself and is only read by other threads when the
	pinned thread is not executing but while 	sched_lock is held, then 	td_pinned does not need any locking. The 	sched_pin function increments the nesting 	count and sched_unpin decrements the 	nesting count. Note that these
functions only operate on the 	current thread and bind the current thread to the CPU
it is 	executing on at the time. To bind an arbitrary thread to a 	specific CPU,
the sched_bind and 	sched_unbind functions should be used 	instead.
The timeout kernel facility permits 	kernel services
to register functions for execution as part 	of the softclock software interrupt. 	Events are scheduled based on
a desired number of clock 	ticks, and callbacks to the consumer-provided function
	will occur at approximately the right time.
The global list of pending timeout events is protected 	by a global spin mutex,
callout_lock; 	all access to the timeout list must be
performed with this 	mutex held. When softclock is
	woken up, it scans the list of pending timeouts for those 	that should fire. In
order to avoid lock order reversal, 	the softclock
thread will release the 	callout_lock mutex when
invoking the 	provided timeout callback function.
	If the CALLOUT_MPSAFE flag was not set 	during
registration, then Giant will be grabbed before 	invoking the callout, and then
released afterwards. The 	callout_lock mutex will be
re-grabbed 	before proceeding. The softclock 	code
is careful to leave the list in a consistent state 	while releasing the mutex. If
DIAGNOSTIC 	is enabled, then the time taken to execute
each function is 	measured, and a warning is generated if it exceeds a
	threshold.