This section talks about machine independent part of the Linuxulator. It covers the emulation infrastructure needed for Linux® 2.6 emulation, the thread local storage (TLS) implementation (on i386) and futexes. Then we talk briefly about some syscalls.
One of the major areas of progress in development of Linux 2.6 	was threading. Prior to 2.6, the Linux threading support was 	implemented in the linuxthreads library. 	The library was a partial
implementation of POSIX® threading. The 	threading
was implemented using separate processes for each thread 	using the clone syscall to let them share the 	address space (and other
things). The main weaknesses of this 	approach was that every thread had a different
PID, signal handling 	was broken (from the pthreads perspective), etc. Also the
performance 	was not very good (use of SIGUSR signals for
	threads synchronization, kernel resource consumption, etc.) so to 	overcome
these problems a new threading system was developed and 	named NPTL.
The NPTL library focused on two things but a third thing came along so it is usually considered a part of NPTL. Those two things were embedding of threads into a process structure and futexes. The additional third thing was TLS, which is not directly required by NPTL but the whole NPTL userland library depends on it. Those improvements yielded in much improved performance and standards conformance. NPTL is a standard threading library in Linux systems these days.
The FreeBSD Linuxulator implementation approaches the NPTL in three main areas. The TLS, futexes and PID mangling, which is meant to simulate the Linux threads. Further sections describe each of these areas.
These sections deal with the way Linux threads are managed and how we simulate that in FreeBSD.
The Linux emulation layer in FreeBSD supports runtime setting of the emulated version. This is done via sysctl(8), namely compat.linux.osrelease, which is set to 2.4.2 by default (as of April 2007) and with all Linux versions up to 2.6 it just determined what uname(1) outputs. It is different with 2.6 emulation where setting this sysctl(8) affects runtime behaviour of the emulation layer. When set to 2.6.x it sets the value of linux_use_linux26 while setting to something else keeps it unset. This variable (plus per-prison variables of the very same kind) determines whether 2.6 infrastructure (mainly PID mangling) is used in the code or not. The version setting is done system-wide and this affects all Linux processes. The sysctl(8) should not be changed when running any Linux binary as it might harm things.
The semantics of Linux threading are a little confusing and uses entirely different nomenclature to FreeBSD. A process in Linux consists of a struct task embedding two identifier fields - PID and TGID. PID is not a process ID but it is a thread ID. The TGID identifies a thread group in other words a process. For single-threaded process the PID equals the TGID.
The thread in NPTL is just an ordinary process that happens to have TGID not equal to PID and have a group leader not equal to itself (and shared VM etc. of course). Everything else happens in the same way as to an ordinary process. There is no separation of a shared status to some external structure like in FreeBSD. This creates some duplication of information and possible data inconsistency. The Linux kernel seems to use task -> group information in some places and task information elsewhere and it is really not very consistent and looks error-prone.
Every NPTL thread is created by a call to the 	 clone syscall with a specific set of flags 	 (more in the
next subsection). The NPTL implements strict 	 1:1 threading.
In FreeBSD we emulate NPTL threads with ordinary FreeBSD processes that share VM space, etc. and the PID gymnastic is just mimicked in the emulation specific structure attached to the process. The structure attached to the process looks like:
struct linux_emuldata {
  pid_t pid;
  int *child_set_tid; /* in clone(): Child.s TID to set on clone */
  int *child_clear_tid;/* in clone(): Child.s TID to clear on exit */
  struct linux_emuldata_shared *shared;
  int pdeath_signal; /* parent death signal */
  LIST_ENTRY(linux_emuldata) threads; /* list of linux threads */
};
The PID is used to identify the FreeBSD process that attaches this 	 structure.
The child_se_tid and 	 child_clear_tid are used for TID address 	 copyout when a
process exits and is created. The 	 shared pointer
points to a structure shared 	 among threads. The pdeath_signal variable 	 identifies the parent death signal
and the 	 threads pointer is used to link this structure
	 to the list of threads. The linux_emuldata_shared 	
structure looks like:
struct linux_emuldata_shared {
  int refs;
  pid_t group_pid;
  LIST_HEAD(, linux_emuldata) threads; /* head of list of linux threads */
};
The refs is a reference counter being used 	 to
determine when we can free the structure to avoid memory leaks. 	 The group_pid is to identify PID ( = TGID) of the 	 whole process
( = thread group). The threads 	 pointer is the head of
the list of threads in the process.
The linux_emuldata structure can be obtained 	 from the
process using em_find. The prototype 	 of the function
is:
struct linux_emuldata *em_find(struct proc *, int locked);
Here, proc is the process we want the emuldata 	
structure from and the locked parameter determines whether we want to 	 lock or not.
The accepted values are EMUL_DOLOCK 	 and EMUL_DOUNLOCK. More about locking 	 later.
Because of the described different view knowing what a process ID and thread ID is between FreeBSD and Linux we have to translate the view somehow. We do it by PID mangling. This means that we fake what a PID (=TGID) and TID (=PID) is between kernel and userland. The rule of thumb is that in kernel (in Linuxulator) PID = PID and TGID = shared -> group pid and to userland we present PID = shared -> group_pid and TID = proc -> p_pid. The PID member of linux_emuldata structure is a FreeBSD PID.
The above affects mainly getpid, getppid, gettid syscalls. Where 	 we use PID/TGID
respectively. In copyout of TIDs in 	 child_clear_tid
and 	 child_set_tid we copy out FreeBSD PID.
The clone syscall is the way threads are 	 created
in Linux. The syscall prototype looks like this:
int linux_clone(l_int flags, void *stack, void *parent_tidptr, int dummy, void * child_tidptr);
The flags parameter tells the syscall how 	 exactly
the processes should be cloned. As described above, Linux
	 can create processes sharing various things independently, for 	 example two
processes can share file descriptors but not VM, etc. 	 Last byte of the flags parameter is the exit 	 signal of the newly created
process. The stack 	 parameter if non-NULL tells, where the thread 	 stack is and if it is NULL we are supposed to 	 copy-on-write the calling process
stack (i.e. do what normal 	 fork(2) routine does).
The parent_tidptr 	 parameter is used as an address for
copying out process PID (i.e. 	 thread id) once the process is sufficiently
instantiated but is 	 not runnable yet. The dummy
parameter is here 	 because of the very strange calling convention of this syscall on
	 i386. It uses the registers directly and does not let the compiler 	 do it what
results in the need of a dummy syscall. The 	 child_tidptr parameter is used as an address 	 for copying out
PID once the process has finished forking and when 	 the process exits.
The syscall itself proceeds by setting corresponding flags 	 depending on the
flags passed in. For example, 	 CLONE_VM maps to RFMEM
(sharing of VM), etc. 	 The only nit here is CLONE_FS and
	 CLONE_FILES because FreeBSD does not allow setting 	
this separately so we fake it by not setting RFFDG (copying of fd 	 table and other
fs information) if either of these is defined. This 	 does not cause any problems,
because those flags are always set 	 together. After setting the flags the process is
forked using 	 the internal fork1 routine, the process
is 	 instrumented not to be put on a run queue, i.e. not to be set 	 runnable.
After the forking is done we possibly reparent the newly 	 created process to emulate
CLONE_PARENT semantics. 	 Next part is creating the
emulation data. Threads in Linux does 	 not signal
their parents so we set exit signal to be 0 to disable 	 this. After that setting of
child_set_tid and 	 child_clear_tid is performed enabling the 	 functionality
later in the code. At this point we copy out the PID 	 to the address specified by
parent_tidptr. The 	 setting of process stack is done by
simply rewriting thread frame 	 %esp register (%rsp on amd64). 	 Next part is setting up TLS for the newly
created process. After 	 this vfork(2) semantics
might be emulated and finally the newly 	 created process is put on a run queue and
copying out its PID to the 	 parent process via clone
return value is 	 done.
The clone syscall is able and in fact is 	 used for
emulating classic fork(2) and vfork(2) syscalls.
	 Newer glibc in a case of 2.6 kernel uses clone 	
to implement fork(2) and vfork(2) syscalls.
The locking is implemented to be per-subsystem because we do not expect a lot of contention on these. There are two locks: emul_lock used to protect manipulating of linux_emuldata and emul_shared_lock used to manipulate linux_emuldata_shared. The emul_lock is a nonsleepable blocking mutex while emul_shared_lock is a sleepable blocking sx_lock. Because of the per-subsystem locking we can coalesce some locks and that is why the em find offers the non-locking access.
This section deals with TLS also known as thread local storage.
Threads in computer science are entities within a process that 	 can be scheduled
independently from each other. The threads in the 	 process share process wide data
(file descriptors, etc.) but also 	 have their own stack for their own data.
Sometimes there is a need 	 for process-wide data specific to a given thread. Imagine
a name of 	 the thread in execution or something like that. The traditional 	
UNIX® threading API, pthreads provides 	 a way to do it via pthread_key_create(3),
	 pthread_setspecific(3)
and pthread_getspecific(3)
where a 	 thread can create a key to the thread local data and using 	 pthread_getspecific(3)
or pthread_getspecific(3)
to 	 manipulate those data. You can easily see that this is not the most 	
comfortable way this could be accomplished. So various producers of 	 C/C++ compilers
introduced a better way. They defined a new modifier 	 keyword thread that specifies
that a variable is thread specific. A 	 new method of accessing such variables was
developed as well (at 	 least on i386). The pthreads
method tends 	 to be implemented in userspace as a trivial lookup table. The 	
performance of such a solution is not very good. So the new method 	 uses (on i386)
segment registers to address a segment, where TLS area 	 is stored so the actual
accessing of a thread variable is just 	 appending the segment register to the
address thus addressing via it. 	 The segment registers are usually %gs and 	 %fs acting like segment
selectors. Every thread 	 has its own area where the thread local data are stored and
the 	 segment must be loaded on every context switch. This method is very 	 fast
and used almost exclusively in the whole i386 UNIX world.
	 Both FreeBSD and Linux implement this approach and
it yields very good 	 results. The only drawback is the need to reload the segment on
	 every context switch which can slowdown context switches. FreeBSD tries 	 to
avoid this overhead by using only 1 segment descriptor for this 	 while Linux uses 3. Interesting thing is that almost nothing uses
	 more than 1 descriptor (only Wine seems to 	 use 2)
so Linux pays this unnecessary price for context 	
switches.
The i386 architecture implements the so called segments. A 	 segment is a
description of an area of memory. The base address 	 (bottom) of the memory area, the
end of it (ceiling), type, 	 protection, etc. The memory described by a segment can
be accessed 	 using segment selector registers (%cs,
	 %ds, %ss, 	 %es, %fs, 	 %gs). For example let us suppose we have a 	 segment which
base address is 0x1234 and length and this code:
mov %edx,%gs:0x10
This will load the content of the %edx 	 register
into memory location 0x1244. Some segment registers have 	 a special use, for example
%cs is used for code 	 segment and %ss is used for stack segment but 	 %fs and %gs are generally 	
unused. Segments are either stored in a global GDT table or in a 	 local LDT table.
LDT is accessed via an entry in the GDT. The 	 LDT can store more types of segments.
LDT can be per process. 	 Both tables define up to 8191 entries.
There are two main ways of setting up TLS in Linux. It
can be 	 set when cloning a process using the clone
	 syscall or it can call set_thread_area. When a 	
process passes CLONE_SETTLS flag to 	 clone, the kernel expects the memory pointed to 	 by the
%esi register a Linux user
space 	 representation of a segment, which gets translated to the machine 	
representation of a segment and loaded into a GDT slot. The 	 GDT slot can be
specified with a number or -1 can be used meaning 	 that the system itself should
choose the first free slot. In 	 practice, the vast majority of programs use only one
TLS entry and 	 does not care about the number of the entry. We exploit this in the
	 emulation and in fact depend on it.
Loading of TLS for the current thread happens by calling 	 set_thread_area while loading TLS for a 	 second process in
clone is done in the 	 separate block in clone. Those two functions 	 are very similar. The only
difference being the actual loading of 	 the GDT segment, which happens on the next
context switch for the 	 newly created process while set_thread_area 	 must load this directly. The code basically
does this. It copies 	 the Linux form segment
descriptor from the userland. The code 	 checks for the number of the descriptor but
because this differs 	 between FreeBSD and Linux we
fake it a little. We only support 	 indexes of 6, 3 and -1. The 6 is genuine Linux number, 3 is 	 genuine FreeBSD one and -1 means
autoselection. Then we set the 	 descriptor number to constant 3 and copy out this to
the 	 userspace. We rely on the userspace process using the number from 	 the
descriptor but this works most of the time (have never seen a 	 case where this did
not work) as the userspace process typically 	 passes in 1. Then we convert the
descriptor from the Linux form 	 to a machine
dependant form (i.e. operating system independent 	 form) and copy this to the
FreeBSD defined segment descriptor. 	 Finally we can load it. We assign the
descriptor to threads PCB 	 (process control block) and load the %gs 	 segment using load_gs.
This loading must be 	 done in a critical section so that nothing can interrupt us.
	 The CLONE_SETTLS case works exactly like this 	 just
the loading using load_gs is not 	 performed. The
segment used for this (segment number 3) is 	 shared for this use between FreeBSD
processes and Linux processes 	 so the Linux emulation layer does not add any overhead over 	 plain
FreeBSD.
The amd64 implementation is similar to the i386 one but there was initially no 32bit segment descriptor used for this purpose (hence not even native 32bit TLS users worked) so we had to add such a segment and implement its loading on every context switch (when a flag signaling use of 32bit is set). Apart from this the TLS loading is exactly the same just the segment numbers are different and the descriptor format and the loading differs slightly.
Threads need some kind of synchronization and POSIX provides some of them: mutexes for mutual exclusion, read-write locks for mutual exclusion with biased ratio of reads and writes and condition variables for signaling a status change. It is interesting to note that POSIX threading API lacks support for semaphores. Those synchronization routines implementations are heavily dependant on the type threading support we have. In pure 1:M (userspace) model the implementation can be solely done in userspace and thus be very fast (the condition variables will probably end up being implemented using signals, i.e. not fast) and simple. In 1:1 model, the situation is also quite clear - the threads must be synchronized using kernel facilities (which is very slow because a syscall must be performed). The mixed M:N scenario just combines the first and second approach or rely solely on kernel. Threads synchronization is a vital part of thread-enabled programming and its performance can affect resulting program a lot. Recent benchmarks on FreeBSD operating system showed that an improved sx_lock implementation yielded 40% speedup in ZFS (a heavy sx user), this is in-kernel stuff but it shows clearly how important the performance of synchronization primitives is.
Threaded programs should be written with as little contention on locks as possible. Otherwise, instead of doing useful work the thread just waits on a lock. Because of this, the most well written threaded programs show little locks contention.
Linux implements 1:1 threading, i.e. it has to use in-kernel synchronization primitives. As stated earlier, well written threaded programs have little lock contention. So a typical sequence could be performed as two atomic increase/decrease mutex reference counter, which is very fast, as presented by the following example:
pthread_mutex_lock(&mutex); .... pthread_mutex_unlock(&mutex);
1:1 threading forces us to perform two syscalls for those mutex calls, which is very slow.
The solution Linux 2.6 implements is called futexes. Futexes implement the check for contention in userspace and call kernel primitives only in a case of contention. Thus the typical case takes place without any kernel intervention. This yields reasonably fast and flexible synchronization primitives implementation.
The futex syscall looks like this:
int futex(void *uaddr, int op, int val, struct timespec *timeout, void *uaddr2, int val3);
In this example uaddr is an address of the 	 mutex in
userspace, op is an operation we are 	 about to perform
and the other parameters have per-operation 	 meaning.
Futexes implement the following operations:
FUTEX_WAIT
FUTEX_WAKE
FUTEX_FD
FUTEX_REQUEUE
FUTEX_CMP_REQUEUE
FUTEX_WAKE_OP
This operation verifies that on address 	 uaddr the
value val 	 is written. If not, EWOULDBLOCK is 	 returned, otherwise the thread is queued on the
futex and gets 	 suspended. If the argument timeout is
	 non-zero it specifies the maximum time for the sleeping, 	 otherwise the
sleeping is infinite.
This operation takes a futex at uaddr 	 and wakes up
val first futexes queued 	 on this futex.
This operations associates a file descriptor with a given futex.
This operation takes val threads 	 queued on futex at
uaddr, wakes them up, 	 and takes val2 next threads and requeues them 	 on futex at uaddr2.
This operation does the same as 	 FUTEX_REQUEUE but it
checks that 	 val3 equals to val 	 first.
This operation performs an atomic operation on 	 val3
(which contains coded some other value) 	 and uaddr.
Then it wakes up 	 val threads on futex at 	 uaddr and if the atomic operation returned a 	 positive number
it wakes up val2 threads on 	 futex at uaddr2.
The operations implemented in FUTEX_WAKE_OP:
FUTEX_OP_SET
FUTEX_OP_ADD
FUTEX_OP_OR
FUTEX_OP_AND
FUTEX_OP_XOR
Note: There is no
val2parameter in the futex prototype. Theval2is taken from thestruct timespec *timeoutparameter for operations FUTEX_REQUEUE, FUTEX_CMP_REQUEUE and FUTEX_WAKE_OP.
The futex emulation in FreeBSD is taken from NetBSD and further extended by us. It is placed in linux_futex.c and linux_futex.h files. The futex structure looks like:
struct futex {
  void *f_uaddr;
  int f_refcount;
  LIST_ENTRY(futex) f_list;
  TAILQ_HEAD(lf_waiting_paroc, waiting_proc) f_waiting_proc;
};
And the structure waiting_proc is:
struct waiting_proc {
  struct thread *wp_t;
  struct futex *wp_new_futex;
  TAILQ_ENTRY(waiting_proc) wp_list;
};
A futex is obtained using the futex_get 	 function,
which searches a linear list of futexes and returns the 	 found one or creates a new
futex. When releasing a futex from the 	 use we call the futex_put function, which 	 decreases a reference counter of
the futex and if the refcount 	 reaches zero it is released.
When a futex queues a thread for sleeping it creates a 	 working_proc structure and puts this structure 	 to the list
inside the futex structure then it just performs a 	 tsleep(9) to suspend
the thread. The sleep can be timed out. 	 After tsleep(9) returns (the
thread was woken up or it timed 	 out) the working_proc
structure is removed 	 from the list and is destroyed. All this is done in the 	
futex_sleep function. If we got woken up 	 from futex_wake we have 	 wp_new_futex set so we sleep on it. This way 	 the actual
requeueing is done in this function.
Waking up a thread sleeping on a futex is performed in the 	 futex_wake function. First in this function 	 we mimic the
strange Linux behaviour, where it wakes up N threads 	
for all operations, the only exception is that the REQUEUE 	 operations are performed
on N+1 threads. But this usually does not 	 make any difference as we are waking up
all threads. Next in the 	 function in the loop we wake up n threads, after this we
check if 	 there is a new futex for requeueing. If so, we requeue up to n2 	
threads on the new futex. This cooperates with 	 futex_sleep.
The FUTEX_WAKE_OP operation is quite 	 complicated. First
we obtain two futexes at addresses 	 uaddr and uaddr2 then we 	 perform the atomic operation using val3 and 	 uaddr2. Then val waiters 	 on the first futex is woken up and if the atomic
operation 	 condition holds we wake up val2 (i.e. 	
timeout) waiter on the second futex.
The atomic operation takes two parameters 	 encoded_op and uaddr. 	 The
encoded operation encodes the operation itself, 	 comparing value, operation
argument, and comparing argument. 	 The pseudocode for the operation is like this
one:
oldval = *uaddr2 *uaddr2 = oldval OP oparg
And this is done atomically. First a copying in of the number 	 at uaddr is performed and the operation is 	 done. The code
handles page faults and if no page fault occurs 	 oldval
is compared to 	 cmparg argument with cmp
comparator.
Futex implementation uses two lock lists protecting 	 sx_lock and global locks (either Giant 	 or another sx_lock). Every operation is 	 performed locked from the
start to the very end.
In this section I am going to describe some smaller syscalls that are worth mentioning because their implementation is not obvious or those syscalls are interesting from other point of view.
During development of Linux 2.6.16 kernel, the *at
syscalls 	 were added. Those syscalls (openat for
example) 	 work exactly like their at-less counterparts with the slight 	
exception of the dirfd parameter. This 	 parameter
changes where the given file, on which the syscall is to be 	 performed, is. When the
filename parameter is 	 absolute dirfd is ignored but when the path to 	 the file is relative,
it comes to the play. The 	 dirfd parameter is a
directory relative to which 	 the relative pathname is checked. The dirfd 	 parameter is a file descriptor of some directory or
	 AT_FDCWD. So for example the 	 openat syscall can be like this:
file descriptor 123 = /tmp/foo/, current working directory = /tmp/ openat(123, /tmp/bah\, flags, mode) /* opens /tmp/bah */ openat(123, bah\, flags, mode) /* opens /tmp/foo/bah */ openat(AT_FDWCWD, bah\, flags, mode) /* opens /tmp/bah */ openat(stdio, bah\, flags, mode) /* returns error because stdio is not a directory */
This infrastructure is necessary to avoid races when opening 	 files outside the
working directory. Imagine that a process consists 	 of two threads, thread A and
thread B. Thread A 	 issues open(./tmp/foo/bah., flags,
mode) and 	 before returning it gets preempted and thread B runs. 	 Thread B
does not care about the needs of thread A and 	 renames or removes /tmp/foo/. We got a race. 	 To avoid this we can open /tmp/foo and use it 	 as dirfd for
openat 	 syscall. This also enables user to implement
per-thread 	 working directories.
Linux family of *at syscalls contains: 	 linux_openat, 	 linux_mkdirat,
	 linux_mknodat, 	 linux_fchownat, 	 linux_futimesat, 	 linux_fstatat64, 	 linux_unlinkat, 	 linux_renameat, 	 linux_linkat,
	 linux_symlinkat, 	 linux_readlinkat, 	 linux_fchmodat and 	 linux_faccessat. All these are implemented 	 using the
modified namei(9) routine and
simple 	 wrapping layer.
The implementation is done by altering the 	 namei(9) routine
(described above) to take 	 additional parameter dirfd
in its 	 nameidata structure, which specifies the 	
starting point of the pathname lookup instead of using the 	 current working
directory every time. The resolution of 	 dirfd from
file descriptor number to a 	 vnode is done in native *at syscalls. When 	 dirfd is AT_FDCWD the 	 dvp entry in nameidata 	 structure is
NULL but when 	 dirfd is a
different number we obtain a 	 file for this file descriptor, check whether this file
	 is valid and if there is vnode attached to it then we get a vnode. 	 Then we
check this vnode for being a directory. In the actual 	 namei(9) routine we
simply substitute the 	 dvp vnode for dp variable 	 in the namei(9) function,
which determines the 	 starting point. The namei(9) is not used
	 directly but via a trace of different functions on various 	 levels. For
example the openat goes like 	 this:
openat() --> kern_openat() --> vn_open() -> namei()
For this reason kern_open and 	 vn_open must be altered to incorporate 	 the additional dirfd parameter. No compat 	 layer is created for those
because there are not many users of 	 this and the users can be easily converted.
This general 	 implementation enables FreeBSD to implement their own *at syscalls.
	 This is being discussed right now.
The ioctl interface is quite fragile due to its generality. 	 We have to bear in
mind that devices differ between Linux and FreeBSD 	
so some care must be applied to do ioctl emulation work right. The 	 ioctl handling
is implemented in linux_ioctl.c, 	 where linux_ioctl function is defined. This 	 function simply
iterates over sets of ioctl handlers to find a 	 handler that implements a given
command. The ioctl syscall has three 	 parameters, the file descriptor, command and
an argument. The 	 command is a 16-bit number, which in theory is divided into high
	 8 bits determining class of the ioctl command and low 	 8 bits, which are the
actual command within the given set. 	 The emulation takes advantage of this
division. We implement 	 handlers for each set, like sound_handler 	 or disk_handler. Each handler has a maximum 	 command and a
minimum command defined, which is used for determining 	 what handler is used. There
are slight problems with this approach 	 because Linux
does not use the set division consistently so 	 sometimes ioctls for a different set
are inside a set they should 	 not belong to (SCSI generic ioctls inside cdrom set,
etc.). FreeBSD 	 currently does not implement many Linux ioctls (compared to 	 NetBSD, for example) but the
plan is to port those from NetBSD. 	 The trend is to use Linux ioctls even in the native FreeBSD drivers 	 because of
the easy porting of applications.
Every syscall should be debuggable. For this purpose we introduce a small infrastructure. We have the ldebug facility, which tells whether a given syscall should be debugged (settable via a sysctl). For printing we have LMSG and ARGS macros. Those are used for altering a printable string for uniform debugging messages.