This section talks about machine independent part of the Linuxulator. It covers the emulation infrastructure needed for Linux® 2.6 emulation, the thread local storage (TLS) implementation (on i386) and futexes. Then we talk briefly about some syscalls.
One of the major areas of progress in development of Linux 2.6 was threading. Prior to 2.6, the Linux threading support was implemented in the linuxthreads library. The library was a partial
implementation of POSIX® threading. The threading
was implemented using separate processes for each thread using the clone
syscall to let them share the address space (and other
things). The main weaknesses of this approach was that every thread had a different
PID, signal handling was broken (from the pthreads perspective), etc. Also the
performance was not very good (use of SIGUSR signals for
threads synchronization, kernel resource consumption, etc.) so to overcome
these problems a new threading system was developed and named NPTL.
The NPTL library focused on two things but a third thing came along so it is usually considered a part of NPTL. Those two things were embedding of threads into a process structure and futexes. The additional third thing was TLS, which is not directly required by NPTL but the whole NPTL userland library depends on it. Those improvements yielded in much improved performance and standards conformance. NPTL is a standard threading library in Linux systems these days.
The FreeBSD Linuxulator implementation approaches the NPTL in three main areas. The TLS, futexes and PID mangling, which is meant to simulate the Linux threads. Further sections describe each of these areas.
These sections deal with the way Linux threads are managed and how we simulate that in FreeBSD.
The Linux emulation layer in FreeBSD supports runtime setting of the emulated version. This is done via sysctl(8), namely compat.linux.osrelease, which is set to 2.4.2 by default (as of April 2007) and with all Linux versions up to 2.6 it just determined what uname(1) outputs. It is different with 2.6 emulation where setting this sysctl(8) affects runtime behaviour of the emulation layer. When set to 2.6.x it sets the value of linux_use_linux26 while setting to something else keeps it unset. This variable (plus per-prison variables of the very same kind) determines whether 2.6 infrastructure (mainly PID mangling) is used in the code or not. The version setting is done system-wide and this affects all Linux processes. The sysctl(8) should not be changed when running any Linux binary as it might harm things.
The semantics of Linux threading are a little confusing and uses entirely different nomenclature to FreeBSD. A process in Linux consists of a struct task embedding two identifier fields - PID and TGID. PID is not a process ID but it is a thread ID. The TGID identifies a thread group in other words a process. For single-threaded process the PID equals the TGID.
The thread in NPTL is just an ordinary process that happens to have TGID not equal to PID and have a group leader not equal to itself (and shared VM etc. of course). Everything else happens in the same way as to an ordinary process. There is no separation of a shared status to some external structure like in FreeBSD. This creates some duplication of information and possible data inconsistency. The Linux kernel seems to use task -> group information in some places and task information elsewhere and it is really not very consistent and looks error-prone.
Every NPTL thread is created by a call to the clone
syscall with a specific set of flags (more in the
next subsection). The NPTL implements strict 1:1 threading.
In FreeBSD we emulate NPTL threads with ordinary FreeBSD processes that share VM space, etc. and the PID gymnastic is just mimicked in the emulation specific structure attached to the process. The structure attached to the process looks like:
struct linux_emuldata { pid_t pid; int *child_set_tid; /* in clone(): Child.s TID to set on clone */ int *child_clear_tid;/* in clone(): Child.s TID to clear on exit */ struct linux_emuldata_shared *shared; int pdeath_signal; /* parent death signal */ LIST_ENTRY(linux_emuldata) threads; /* list of linux threads */ };
The PID is used to identify the FreeBSD process that attaches this structure.
The child_se_tid
and child_clear_tid
are used for TID address copyout when a
process exits and is created. The shared
pointer
points to a structure shared among threads. The pdeath_signal
variable identifies the parent death signal
and the threads
pointer is used to link this structure
to the list of threads. The linux_emuldata_shared
structure looks like:
struct linux_emuldata_shared { int refs; pid_t group_pid; LIST_HEAD(, linux_emuldata) threads; /* head of list of linux threads */ };
The refs
is a reference counter being used to
determine when we can free the structure to avoid memory leaks. The group_pid
is to identify PID ( = TGID) of the whole process
( = thread group). The threads
pointer is the head of
the list of threads in the process.
The linux_emuldata structure can be obtained from the
process using em_find
. The prototype of the function
is:
struct linux_emuldata *em_find(struct proc *, int locked);
Here, proc
is the process we want the emuldata
structure from and the locked parameter determines whether we want to lock or not.
The accepted values are EMUL_DOLOCK and EMUL_DOUNLOCK. More about locking later.
Because of the described different view knowing what a process ID and thread ID is between FreeBSD and Linux we have to translate the view somehow. We do it by PID mangling. This means that we fake what a PID (=TGID) and TID (=PID) is between kernel and userland. The rule of thumb is that in kernel (in Linuxulator) PID = PID and TGID = shared -> group pid and to userland we present PID = shared -> group_pid and TID = proc -> p_pid. The PID member of linux_emuldata structure is a FreeBSD PID.
The above affects mainly getpid, getppid, gettid syscalls. Where we use PID/TGID
respectively. In copyout of TIDs in child_clear_tid
and child_set_tid
we copy out FreeBSD PID.
The clone
syscall is the way threads are created
in Linux. The syscall prototype looks like this:
int linux_clone(l_int flags, void *stack, void *parent_tidptr, int dummy, void * child_tidptr);
The flags
parameter tells the syscall how exactly
the processes should be cloned. As described above, Linux
can create processes sharing various things independently, for example two
processes can share file descriptors but not VM, etc. Last byte of the flags
parameter is the exit signal of the newly created
process. The stack
parameter if non-NULL tells, where the thread stack is and if it is NULL we are supposed to copy-on-write the calling process
stack (i.e. do what normal fork(2) routine does).
The parent_tidptr
parameter is used as an address for
copying out process PID (i.e. thread id) once the process is sufficiently
instantiated but is not runnable yet. The dummy
parameter is here because of the very strange calling convention of this syscall on
i386. It uses the registers directly and does not let the compiler do it what
results in the need of a dummy syscall. The child_tidptr
parameter is used as an address for copying out
PID once the process has finished forking and when the process exits.
The syscall itself proceeds by setting corresponding flags depending on the
flags passed in. For example, CLONE_VM maps to RFMEM
(sharing of VM), etc. The only nit here is CLONE_FS and
CLONE_FILES because FreeBSD does not allow setting
this separately so we fake it by not setting RFFDG (copying of fd table and other
fs information) if either of these is defined. This does not cause any problems,
because those flags are always set together. After setting the flags the process is
forked using the internal fork1
routine, the process
is instrumented not to be put on a run queue, i.e. not to be set runnable.
After the forking is done we possibly reparent the newly created process to emulate
CLONE_PARENT semantics. Next part is creating the
emulation data. Threads in Linux does not signal
their parents so we set exit signal to be 0 to disable this. After that setting of
child_set_tid
and child_clear_tid
is performed enabling the functionality
later in the code. At this point we copy out the PID to the address specified by
parent_tidptr
. The setting of process stack is done by
simply rewriting thread frame %esp
register (%rsp
on amd64). Next part is setting up TLS for the newly
created process. After this vfork(2) semantics
might be emulated and finally the newly created process is put on a run queue and
copying out its PID to the parent process via clone
return value is done.
The clone
syscall is able and in fact is used for
emulating classic fork(2) and vfork(2) syscalls.
Newer glibc in a case of 2.6 kernel uses clone
to implement fork(2) and vfork(2) syscalls.
The locking is implemented to be per-subsystem because we do not expect a lot of contention on these. There are two locks: emul_lock used to protect manipulating of linux_emuldata and emul_shared_lock used to manipulate linux_emuldata_shared. The emul_lock is a nonsleepable blocking mutex while emul_shared_lock is a sleepable blocking sx_lock. Because of the per-subsystem locking we can coalesce some locks and that is why the em find offers the non-locking access.
This section deals with TLS also known as thread local storage.
Threads in computer science are entities within a process that can be scheduled
independently from each other. The threads in the process share process wide data
(file descriptors, etc.) but also have their own stack for their own data.
Sometimes there is a need for process-wide data specific to a given thread. Imagine
a name of the thread in execution or something like that. The traditional
UNIX® threading API, pthreads provides a way to do it via pthread_key_create(3),
pthread_setspecific(3)
and pthread_getspecific(3)
where a thread can create a key to the thread local data and using pthread_getspecific(3)
or pthread_getspecific(3)
to manipulate those data. You can easily see that this is not the most
comfortable way this could be accomplished. So various producers of C/C++ compilers
introduced a better way. They defined a new modifier keyword thread that specifies
that a variable is thread specific. A new method of accessing such variables was
developed as well (at least on i386). The pthreads
method tends to be implemented in userspace as a trivial lookup table. The
performance of such a solution is not very good. So the new method uses (on i386)
segment registers to address a segment, where TLS area is stored so the actual
accessing of a thread variable is just appending the segment register to the
address thus addressing via it. The segment registers are usually %gs
and %fs
acting like segment
selectors. Every thread has its own area where the thread local data are stored and
the segment must be loaded on every context switch. This method is very fast
and used almost exclusively in the whole i386 UNIX world.
Both FreeBSD and Linux implement this approach and
it yields very good results. The only drawback is the need to reload the segment on
every context switch which can slowdown context switches. FreeBSD tries to
avoid this overhead by using only 1 segment descriptor for this while Linux uses 3. Interesting thing is that almost nothing uses
more than 1 descriptor (only Wine seems to use 2)
so Linux pays this unnecessary price for context
switches.
The i386 architecture implements the so called segments. A segment is a
description of an area of memory. The base address (bottom) of the memory area, the
end of it (ceiling), type, protection, etc. The memory described by a segment can
be accessed using segment selector registers (%cs
,
%ds
, %ss
, %es
, %fs
, %gs
). For example let us suppose we have a segment which
base address is 0x1234 and length and this code:
mov %edx,%gs:0x10
This will load the content of the %edx
register
into memory location 0x1244. Some segment registers have a special use, for example
%cs
is used for code segment and %ss
is used for stack segment but %fs
and %gs
are generally
unused. Segments are either stored in a global GDT table or in a local LDT table.
LDT is accessed via an entry in the GDT. The LDT can store more types of segments.
LDT can be per process. Both tables define up to 8191 entries.
There are two main ways of setting up TLS in Linux. It
can be set when cloning a process using the clone
syscall or it can call set_thread_area
. When a
process passes CLONE_SETTLS flag to clone
, the kernel expects the memory pointed to by the
%esi
register a Linux user
space representation of a segment, which gets translated to the machine
representation of a segment and loaded into a GDT slot. The GDT slot can be
specified with a number or -1 can be used meaning that the system itself should
choose the first free slot. In practice, the vast majority of programs use only one
TLS entry and does not care about the number of the entry. We exploit this in the
emulation and in fact depend on it.
Loading of TLS for the current thread happens by calling set_thread_area
while loading TLS for a second process in
clone
is done in the separate block in clone
. Those two functions are very similar. The only
difference being the actual loading of the GDT segment, which happens on the next
context switch for the newly created process while set_thread_area
must load this directly. The code basically
does this. It copies the Linux form segment
descriptor from the userland. The code checks for the number of the descriptor but
because this differs between FreeBSD and Linux we
fake it a little. We only support indexes of 6, 3 and -1. The 6 is genuine Linux number, 3 is genuine FreeBSD one and -1 means
autoselection. Then we set the descriptor number to constant 3 and copy out this to
the userspace. We rely on the userspace process using the number from the
descriptor but this works most of the time (have never seen a case where this did
not work) as the userspace process typically passes in 1. Then we convert the
descriptor from the Linux form to a machine
dependant form (i.e. operating system independent form) and copy this to the
FreeBSD defined segment descriptor. Finally we can load it. We assign the
descriptor to threads PCB (process control block) and load the %gs
segment using load_gs
.
This loading must be done in a critical section so that nothing can interrupt us.
The CLONE_SETTLS case works exactly like this just
the loading using load_gs
is not performed. The
segment used for this (segment number 3) is shared for this use between FreeBSD
processes and Linux processes so the Linux emulation layer does not add any overhead over plain
FreeBSD.
The amd64 implementation is similar to the i386 one but there was initially no 32bit segment descriptor used for this purpose (hence not even native 32bit TLS users worked) so we had to add such a segment and implement its loading on every context switch (when a flag signaling use of 32bit is set). Apart from this the TLS loading is exactly the same just the segment numbers are different and the descriptor format and the loading differs slightly.
Threads need some kind of synchronization and POSIX provides some of them: mutexes for mutual exclusion, read-write locks for mutual exclusion with biased ratio of reads and writes and condition variables for signaling a status change. It is interesting to note that POSIX threading API lacks support for semaphores. Those synchronization routines implementations are heavily dependant on the type threading support we have. In pure 1:M (userspace) model the implementation can be solely done in userspace and thus be very fast (the condition variables will probably end up being implemented using signals, i.e. not fast) and simple. In 1:1 model, the situation is also quite clear - the threads must be synchronized using kernel facilities (which is very slow because a syscall must be performed). The mixed M:N scenario just combines the first and second approach or rely solely on kernel. Threads synchronization is a vital part of thread-enabled programming and its performance can affect resulting program a lot. Recent benchmarks on FreeBSD operating system showed that an improved sx_lock implementation yielded 40% speedup in ZFS (a heavy sx user), this is in-kernel stuff but it shows clearly how important the performance of synchronization primitives is.
Threaded programs should be written with as little contention on locks as possible. Otherwise, instead of doing useful work the thread just waits on a lock. Because of this, the most well written threaded programs show little locks contention.
Linux implements 1:1 threading, i.e. it has to use in-kernel synchronization primitives. As stated earlier, well written threaded programs have little lock contention. So a typical sequence could be performed as two atomic increase/decrease mutex reference counter, which is very fast, as presented by the following example:
pthread_mutex_lock(&mutex); .... pthread_mutex_unlock(&mutex);
1:1 threading forces us to perform two syscalls for those mutex calls, which is very slow.
The solution Linux 2.6 implements is called futexes. Futexes implement the check for contention in userspace and call kernel primitives only in a case of contention. Thus the typical case takes place without any kernel intervention. This yields reasonably fast and flexible synchronization primitives implementation.
The futex syscall looks like this:
int futex(void *uaddr, int op, int val, struct timespec *timeout, void *uaddr2, int val3);
In this example uaddr
is an address of the mutex in
userspace, op
is an operation we are about to perform
and the other parameters have per-operation meaning.
Futexes implement the following operations:
FUTEX_WAIT
FUTEX_WAKE
FUTEX_FD
FUTEX_REQUEUE
FUTEX_CMP_REQUEUE
FUTEX_WAKE_OP
This operation verifies that on address uaddr
the
value val
is written. If not, EWOULDBLOCK is returned, otherwise the thread is queued on the
futex and gets suspended. If the argument timeout
is
non-zero it specifies the maximum time for the sleeping, otherwise the
sleeping is infinite.
This operation takes a futex at uaddr
and wakes up
val
first futexes queued on this futex.
This operations associates a file descriptor with a given futex.
This operation takes val
threads queued on futex at
uaddr
, wakes them up, and takes val2
next threads and requeues them on futex at uaddr2
.
This operation does the same as FUTEX_REQUEUE but it
checks that val3
equals to val
first.
This operation performs an atomic operation on val3
(which contains coded some other value) and uaddr
.
Then it wakes up val
threads on futex at uaddr
and if the atomic operation returned a positive number
it wakes up val2
threads on futex at uaddr2
.
The operations implemented in FUTEX_WAKE_OP:
FUTEX_OP_SET
FUTEX_OP_ADD
FUTEX_OP_OR
FUTEX_OP_AND
FUTEX_OP_XOR
Note: There is no
val2
parameter in the futex prototype. Theval2
is taken from thestruct timespec *timeout
parameter for operations FUTEX_REQUEUE, FUTEX_CMP_REQUEUE and FUTEX_WAKE_OP.
The futex emulation in FreeBSD is taken from NetBSD and further extended by us. It is placed in linux_futex.c and linux_futex.h files. The futex structure looks like:
struct futex { void *f_uaddr; int f_refcount; LIST_ENTRY(futex) f_list; TAILQ_HEAD(lf_waiting_paroc, waiting_proc) f_waiting_proc; };
And the structure waiting_proc is:
struct waiting_proc { struct thread *wp_t; struct futex *wp_new_futex; TAILQ_ENTRY(waiting_proc) wp_list; };
A futex is obtained using the futex_get
function,
which searches a linear list of futexes and returns the found one or creates a new
futex. When releasing a futex from the use we call the futex_put
function, which decreases a reference counter of
the futex and if the refcount reaches zero it is released.
When a futex queues a thread for sleeping it creates a working_proc structure and puts this structure to the list
inside the futex structure then it just performs a tsleep(9) to suspend
the thread. The sleep can be timed out. After tsleep(9) returns (the
thread was woken up or it timed out) the working_proc
structure is removed from the list and is destroyed. All this is done in the
futex_sleep
function. If we got woken up from futex_wake
we have wp_new_futex
set so we sleep on it. This way the actual
requeueing is done in this function.
Waking up a thread sleeping on a futex is performed in the futex_wake
function. First in this function we mimic the
strange Linux behaviour, where it wakes up N threads
for all operations, the only exception is that the REQUEUE operations are performed
on N+1 threads. But this usually does not make any difference as we are waking up
all threads. Next in the function in the loop we wake up n threads, after this we
check if there is a new futex for requeueing. If so, we requeue up to n2
threads on the new futex. This cooperates with futex_sleep
.
The FUTEX_WAKE_OP operation is quite complicated. First
we obtain two futexes at addresses uaddr
and uaddr2
then we perform the atomic operation using val3
and uaddr2
. Then val
waiters on the first futex is woken up and if the atomic
operation condition holds we wake up val2
(i.e.
timeout
) waiter on the second futex.
The atomic operation takes two parameters encoded_op
and uaddr
. The
encoded operation encodes the operation itself, comparing value, operation
argument, and comparing argument. The pseudocode for the operation is like this
one:
oldval = *uaddr2 *uaddr2 = oldval OP oparg
And this is done atomically. First a copying in of the number at uaddr
is performed and the operation is done. The code
handles page faults and if no page fault occurs oldval
is compared to cmparg
argument with cmp
comparator.
Futex implementation uses two lock lists protecting sx_lock
and global locks (either Giant or another sx_lock
). Every operation is performed locked from the
start to the very end.
In this section I am going to describe some smaller syscalls that are worth mentioning because their implementation is not obvious or those syscalls are interesting from other point of view.
During development of Linux 2.6.16 kernel, the *at
syscalls were added. Those syscalls (openat
for
example) work exactly like their at-less counterparts with the slight
exception of the dirfd
parameter. This parameter
changes where the given file, on which the syscall is to be performed, is. When the
filename
parameter is absolute dirfd
is ignored but when the path to the file is relative,
it comes to the play. The dirfd
parameter is a
directory relative to which the relative pathname is checked. The dirfd
parameter is a file descriptor of some directory or
AT_FDCWD. So for example the openat
syscall can be like this:
file descriptor 123 = /tmp/foo/, current working directory = /tmp/ openat(123, /tmp/bah\, flags, mode) /* opens /tmp/bah */ openat(123, bah\, flags, mode) /* opens /tmp/foo/bah */ openat(AT_FDWCWD, bah\, flags, mode) /* opens /tmp/bah */ openat(stdio, bah\, flags, mode) /* returns error because stdio is not a directory */
This infrastructure is necessary to avoid races when opening files outside the
working directory. Imagine that a process consists of two threads, thread A and
thread B. Thread A issues open(./tmp/foo/bah., flags,
mode) and before returning it gets preempted and thread B runs. Thread B
does not care about the needs of thread A and renames or removes /tmp/foo/. We got a race. To avoid this we can open /tmp/foo and use it as dirfd
for
openat
syscall. This also enables user to implement
per-thread working directories.
Linux family of *at syscalls contains: linux_openat
, linux_mkdirat
,
linux_mknodat
, linux_fchownat
, linux_futimesat
, linux_fstatat64
, linux_unlinkat
, linux_renameat
, linux_linkat
,
linux_symlinkat
, linux_readlinkat
, linux_fchmodat
and linux_faccessat
. All these are implemented using the
modified namei(9) routine and
simple wrapping layer.
The implementation is done by altering the namei(9) routine
(described above) to take additional parameter dirfd
in its nameidata structure, which specifies the
starting point of the pathname lookup instead of using the current working
directory every time. The resolution of dirfd
from
file descriptor number to a vnode is done in native *at syscalls. When dirfd
is AT_FDCWD the dvp
entry in nameidata structure is
NULL but when dirfd
is a
different number we obtain a file for this file descriptor, check whether this file
is valid and if there is vnode attached to it then we get a vnode. Then we
check this vnode for being a directory. In the actual namei(9) routine we
simply substitute the dvp
vnode for dp
variable in the namei(9) function,
which determines the starting point. The namei(9) is not used
directly but via a trace of different functions on various levels. For
example the openat
goes like this:
openat() --> kern_openat() --> vn_open() -> namei()
For this reason kern_open
and vn_open
must be altered to incorporate the additional dirfd
parameter. No compat layer is created for those
because there are not many users of this and the users can be easily converted.
This general implementation enables FreeBSD to implement their own *at syscalls.
This is being discussed right now.
The ioctl interface is quite fragile due to its generality. We have to bear in
mind that devices differ between Linux and FreeBSD
so some care must be applied to do ioctl emulation work right. The ioctl handling
is implemented in linux_ioctl.c, where linux_ioctl
function is defined. This function simply
iterates over sets of ioctl handlers to find a handler that implements a given
command. The ioctl syscall has three parameters, the file descriptor, command and
an argument. The command is a 16-bit number, which in theory is divided into high
8 bits determining class of the ioctl command and low 8 bits, which are the
actual command within the given set. The emulation takes advantage of this
division. We implement handlers for each set, like sound_handler
or disk_handler
. Each handler has a maximum command and a
minimum command defined, which is used for determining what handler is used. There
are slight problems with this approach because Linux
does not use the set division consistently so sometimes ioctls for a different set
are inside a set they should not belong to (SCSI generic ioctls inside cdrom set,
etc.). FreeBSD currently does not implement many Linux ioctls (compared to NetBSD, for example) but the
plan is to port those from NetBSD. The trend is to use Linux ioctls even in the native FreeBSD drivers because of
the easy porting of applications.
Every syscall should be debuggable. For this purpose we introduce a small infrastructure. We have the ldebug facility, which tells whether a given syscall should be debugged (settable via a sysctl). For printing we have LMSG and ARGS macros. Those are used for altering a printable string for uniform debugging messages.