seL4/manual/parts/threads.tex

%
% Copyright 2014, General Dynamics C4 Systems
%
% SPDX-License-Identifier: GPL-2.0-only
%

\chapter{\label{ch:threads}Threads and Execution}

\section{Threads}
\label{sec:threads}


seL4 provides threads to represent an execution context. On MCS configurations of
the kernel, scheduling contexts are used to manage processor time. Without MCS, processor
time is also represented by the thread abstraction.
A thread is represented in seL4 by its thread control block
object (\obj{TCB}).

With MCS, a scheduling context is represented by a scheduling context object
(\obj{SCO}), and threads cannot run unless they are bound to, or receive a
scheduling context.

\subsection{Thread control blocks}

Each \obj{TCB} has an associated CSpace (see
\autoref{ch:cspace}) and VSpace (see \autoref{ch:vspace}) which
may be shared with other threads. A \obj{TCB} may also have an IPC buffer
(see  \autoref{ch:ipc}), which is used to pass extra arguments during IPC
or kernel object invocation that do not fit in the architecture-defined message
registers. While it is not compulsory that a thread has an IPC buffer,
it will not be able to perform most kernel invocations, as they require
cap transfer.
Each thread belongs to exactly one security domain (see
\autoref{sec:domains}).
%FIXME: there is much more information held in the TCB!

\subsection{Thread Creation}
\label{sec:thread_creation}

Like other objects, \obj{TCB}s are created with the
\apifunc{seL4\_Untyped\_Retype}{untyped_retype} method (see
\autoref{sec:kernmemalloc}). A newly created thread is initially inactive. It
is configured by setting its CSpace and VSpace with the
\apifunc{seL4\_TCB\_SetSpace}{tcb_setspace}
or \apifunc{seL4\_TCB\_Configure}{tcb_configure} methods and then calling
\apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} with an initial stack pointer and instruction
pointer. The thread can then be activated either by setting the
\texttt{resume\_target} parameter in the \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} invocation to true
or by separately calling the \apifunc{seL4\_TCB\_Resume}{tcb_resume} method. Both of these methods
place the thread in a runnable state.

In non-MCS configurations of the kernel, this will result in the thread immediately being added to
the scheduler. On the MCS kernel, the thread will only begin running if it has a
scheduling context object.

In a SMP configuration of the kernel, the thread will resume on the node
corresponding to the affinity of the thread. For non-MCS configurations, the
default thread affinity is the node the thread's \obj{TCB} object was created
on, and \apifunc{seL4\_TCB\_SetAffinity}{tcb_setaffinity} can be used to
explicitly set the affinity. On MCS configurations, the affinity is derived
from the scheduling context object (see \autoref{sec:sc_creation}).

\subsection{Thread Deactivation}
\label{sec:thread_deactivation}

The \apifunc{seL4\_TCB\_Suspend}{tcb_suspend} method deactivates a thread.
Suspended threads can later be resumed.
Their suspended state can be retrieved with the
\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and
\apifunc{seL4\_TCB\_CopyRegisters}{tcb_copyregisters} methods.
They can also be reconfigured and
reused or left suspended indefinitely if not needed. Threads will be
automatically suspended when the last capability to their \obj{TCB} is
deleted.
% an example of which is demonstrated in \nameref{ex:second_thread}.

\subsection{Thread Feature Flags}
\label{sec:thread_flags}
Specific features can be enabled or disabled on a per TCB basis with
\apifunc{seL4\_TCB\_SetFlags}{tcb_setflags}.

If access to the Floating Point Unit is not needed, it can be disabled
for individual threads to maximise context switching speed.

\subsection{Affinity}
\label{sec:thread_affinity}

It is architecture and platform specific, how an affinity value maps to a
specific node (core, hart) on a specific platform. There is no guarantee that
affinity values are compatible across different platforms.


\subsection{Scheduling}
\label{sec:sched}

seL4 uses a preemptive, tickless scheduler with 256 priority levels (0 --- 255).  All threads have
a maximum controlled priority (MCP) and a priority, the latter being the effective priority of the
thread.
When a thread modifies another thread's priority (including itself) it must provide a
thread capability from which to use the MCP from. Threads can only set priorities and MCPs
to be less than or equal to the provided thread's MCP.
The initial task starts with an MCP and priority as the highest priority in the system (\texttt{seL4\_MaxPrio}).
Thread priority and MCP can be
set with \apifunc{seL4\_TCB\_SetSchedParams}{tcb_setschedparams} and
\apifunc{seL4\_TCB\_SetPriority}{tcb_setpriority},
\apifunc{seL4\_TCB\_SetMCPriority}{tcb_setmcpriority} methods.


Of threads eligible for scheduling, the highest priority thread in a runnable state is chosen.

Thread priority (structure \texttt{seL4\_PrioProps\_t}) consists of two values as follows:

\begin{description} \item[Priority] the priority a thread will be scheduled with.  \item[Maximum
controlled priority (MCP)] the highest priority a thread can set itself or another thread to.
\end{description}

\subsection{MCS Scheduling}

This section only applies to configurations with MCS enabled, where threads must have
a scheduling context object available in order to be admitted to the scheduler.

\subsection{Scheduling Contexts}
\label{sec:scheduling_contexts}

Access to CPU execution time is controlled through scheduling context objects.
Scheduling contexts are configured with a tuple of
\textit{budget}$(b)$ and \textit{period} $(p)$, both in microseconds, set by
\apifunc{seL4\_SchedControl\_Configure\_Flags}{schedcontrol_configureflags} (see \autoref{sec:sc_creation}).
The tuple $(b, p)$ forms an upper bound on the thread's execution -- the kernel will not permit a
thread to run for more than $b$ out of every $p$ microseconds. However, $\frac{b}{p}$ does not
represent a lower bound on execution, as a thread must have the highest or equal highest priority
of all runnable threads to be guaranteed to be scheduled at all, and the kernel does not conduct
an admission test. As a result the set of all parameters is not necessarily schedulable. If
multiple threads have budgets available concurrently they are scheduled first-in first-out, and
round-robin scheduling is applied once the budget is expired.

A scheduling context that is eligible to be picked by the scheduler, i.e has budget available, is
referred to as \emph{active}.  Budget charging and replenishment rules are different for round-robin
and sporadic threads.  For round-robin threads, the budget is charged each time the current node's
scheduling context is changed, until it is depleted and then refilled immediately.

Threads where $b == p$ are treated as round robin threads, where $b$ acts as a timeslice.
Otherwise the kernel uses \emph{sporadic servers} to enforce temporal isolation, which enforce the
property that $\frac{b}{p}$ cannot be exceeded for all possible $p$.
In theory, sporadic servers provide temporal isolation -- preventing threads from exceeding their allocated budget -- by using the following algorithm:

\begin{itemize}
\item When a thread starts executing at current time $T$, record $T_{s}$
\item When a thread stops executing (blocks or is preempted), schedule a replenishment at $T_{s} + p$ for the
amount of time consumed ($T - T_{s}$) and subtract that from the current replenishment being used.
\end{itemize}

seL4 implements this algorithm by maintaining an ordered list of sporadic replenishments --
\texttt{refills} for brevity -- in each scheduling context. Each replenishment contains a tuple
of the time it is eligible for use (\emph{rTime}) and the amount that replenishment is for
(\texttt{rAmount}).
While a thread is executing, it constantly drains the budget from the \texttt{rAmount} at the head of the
replenishment list. If the \texttt{rTime} is in the future, the thread bound to that
scheduling context is placed in a queue of threads waiting for more budget.

Round-robin threads are treated that same as sporadic threads except regarding one aspect: how the
budget is charged. Round-robin threads have two refills only, both of which are always ready to be
used. When a round-robin thread stops executing, budget is moved from the head to the tail
replenishment. Once the head budget is consumed, the thread is removed from the scheduling queue
for its priority and appended at the tail.

Sporadic threads behave differently depending on the amount of replenishments available, which
must be bounded. Developers have two options to configure the size of the replenishment list:

\begin{itemize}
\item The maximum number of refills in a single scheduling context is determined by the size
      of the scheduling context when created by \apifunc{seL4\_Untyped\_Retype}{untyped_retype}.
\item A per scheduling context parameter, \texttt{extra\_refills} that
limits the number of refills for that specific scheduling context. This value is added to the base
value of 2 and is limited by the size of the scheduling context.
\end{itemize}

Threads that have short execution times (e.g interrupt handlers) and are not frequently preempted
should have less refills, while longer running threads with long values of $b$ should have a higher
value. Threads bound to a scheduling context with 0 extra refills will run periodically -- tasks
that use their head replenishment, or call yield, will not be scheduled again until the start of
their next period.

Given the number of replenishments is limited, if a node's SC changes and the outgoing SC does not
have enough space to store the new replenishment, space is created by removing the current
replenishment which can result in preemption if the next replenishment is not yet available.
Scheduling contexts with a higher number of configured refills will consume closer
to their whole budget, as they can be preempted and switch threads more often without filling their
replenishment queue. However, the scheduling overhead will be higher as the replenishment list is
subject to fragmentation.

Whenever a thread is executing it consumes the budget from its current scheduling context.  The
system call \apifunc{seL4\_Yield}{sel4_yield} can be used to sacrifice any remaining budget and
block until the next replenishment is ready to be used.

Threads can be bound to scheduling contexts using \apifunc{seL4\_TCB\_Configure}{tcb_configure} or
\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind}, both invocations have the same effect
although \apifunc{seL4\_TCB\_Configure}{tcb_configure} allows more thread fields to be set with only
one kernel entry.  When a thread is bound to a scheduling context, if it is in a runnable state and
the scheduling context is active, it will be added to the scheduler.

\subsection{Passive Threads} \label{sec:passive}

Threads can be unbound from a scheduling context with
\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}.  This is distinct from
suspending a thread, in that threads that are blocked waiting in an endpoint or notification queue
will remain in the queue and can still receive messages and signals.  However, the unbound thread
will not be schedul\-able again until it receives a scheduling context.  Threads without scheduling
contexts are referred to as \emph{passive} threads, as they cannot execute without the action of
another thread.

\subsection{Scheduling Context Creation} \label{sec:sc_creation}

Like other objects, scheduling contexts are created from untyped memory using
\apifunc{seL4\_UntypedRetype}{untyped_retype}.  On creation, scheduling contexts are empty,
representing 0\% of CPU execution time.  To populate a scheduling context with parameters, one must
invoke the appropriate \obj{SchedControl} capability, which provides access to CPU time management
on a single node.  A scheduling control cap for each node is provided to the initial task at run
time.  Threads run on the node that their scheduling context is configured for.  Scheduling context
parameters can then be set and updated using
\apifunc{seL4\_SchedControl\_ConfigureFlags}{schedcontrol_configureflags}, which allows the budget and period
to be specified along with a bitwise OR'd set of the following flags.

\begin{description}

\item[seL4\_SchedContext\_Sporadic]: constrain the execution time only according to the
sporadic server algorithm rather than to a continuous constant bandwidth.

\end{description}

The kernel does not conduct any schedulability tests, as task admission is left to user-level policy
and can be conducted online or offline, statically or dynamically or not at all.

\subsection{Scheduling Context Donation}

In addition to explicitly binding and removing scheduling contexts through
\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind} and
\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}, scheduling contexts can move
between threads over IPC.  Scheduling contexts are donated implicitly when the system calls
\apifunc{seL4\_Call}{sel4_call} and \apifunc{seL4\_NBSendRecv}{sel4_nbsendrecv} are used to
communicate with a passive thread.  When an active thread invokes an endpoint with
\apifunc{seL4\_Call}{sel4_call} and rendezvous with a passive thread, the active thread's scheduling
context is donated to the passive thread. The generated reply cap ensures that the callee is merely
borrowing the scheduling context: when the reply cap is consumed by a reply message being sent the
scheduling context will be returned to the caller.  If the reply cap is revoked, and the callee
holds the scheduling context, the scheduling context will be returned to the caller.  However, if in
a deep call chain and a reply cap in the middle of the call chain is revoked, such that the callee
does not possess the scheduling context, the thread will be removed from the call chain and the
scheduling context will remain where it is.  If the receiver does not provide a reply object to
track the donation in (i.e uses \apifunc{seL4\_Wait}{sel4_wait} instead of
\apifunc{seL4\_Recv}{sel4_recv} scheduling context donation will not occur but the message will be
delivered. The passive receiver will be set to inactive as it does not have a scheduling context.

Consider an example where thread A calls thread B which calls thread C.  If whilst C holds the
scheduling context, B's reply cap to A is revoked, then the scheduling context will remain with C.
However, a call chain will remain between A and C, such that if C's reply cap is revoked, or
invoked, the scheduling context will return to A.

\apifunc{seL4\_NBSendRecv}{sel4_nbsendrecv} can also result in scheduling context donation.
If the non-blocking send phase of the operation results in message delivery to a passive thread, the
scheduling context will be donated to that passive thread and the thread making the system call becomes passive on the
receiving endpoint in the receive phase.  No reply capability is generated, so there
is no guarantee that the scheduling context will return, which increases book keeping complexity but allows
for data-flow like architectures rather than remote-procedure calls. Note that \apifunc{seL4\_Call}{sel4_call}
does not guarantee the return of a scheduling context: this is an inherently trusted operation as the
server could never reply and return the scheduling context.

Scheduling contexts can also be bound to notification objects using
\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind} and unbound using
\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}.  If a signal is delivered to
a notification object with a passive thread blocked waiting on it, the passive thread will receive
the scheduling context that is bound to the notification object.  The scheduling context is returned
when the thread blocks on the notification object.  This feature allows for passive servers to use
notification binding (See \autoref{sec:notification-binding}).  If a scheduling context is bound to
both a notification object and a thread, the behaviour will be the same as for a passive server:
The scheduling context will be unbound from the thread when it blocks on the bound notification object.
This is useful when launching passive servers or handling timeout exceptions.

Scheduling contexts can be unbound from all objects (notification objects and TCBs that are bound or
have received a scheduling context through donation) using
\apifunc{seL4\_SchedContext\_Unbind}{schedcontext_unbind}.

Passive threads will run on the CPU node that the scheduling context was configured with, and will
be migrated on IPC.

\subsection{Scheduling algorithm} \label{sec:mcs-sched}

Threads are only eligible for scheduling if they have an active scheduling context.
Of threads eligible for scheduling, the highest priority thread in a runnable state is chosen.

Threads of sufficient maximum controlled priority and with possession of the
appropriate scheduling context capability can manipulate the scheduler and
implement user-level schedulers using IPC.

Scheduling contexts provide access to an upper bound on execution CPU time,
however when a thread executes is determined by thread priority.  Consequently,
access to CPU is a function of thread MCPs, scheduling contexts and the
\obj{SchedControl} capability.  The kernel will enforce that threads do not
exceed the budget in their scheduling context for any given period, and that
the highest priority thread will always run, however it is up to the system
designer to make sure the entire system is schedulable.


\subsection{Exceptions}
\label{sec:exceptions}

Each thread has two associated exception-handler endpoints, a \emph{standard}
exception handler and a \emph{timeout} exception handler, where the latter is MCS
only. If the thread causes
an exception, the kernel creates an IPC message with the relevant details and
sends this to the endpoint. This thread can then take the appropriate action.
Fault IPC messages are described in \autoref{sec:faults}.  Standard exception-handler
endpoints can be set with the \apifunc{seL4\_TCB\_SetSpace}{tcb_setspace} or
\apifunc{seL4\_TCB\_SetSchedParams}{tcb_setschedparams} methods while Timeout exception
handlers can be set with \apifunc{seL4\_TCB\_SetTimeoutEndpoint}{tcb_settimeoutendpoint} (MCS only).
With these methods, a
capability address for the exception handler can be associated with a thread.
This address is then used to lookup the handler endpoint, and the capability to
the endpoint is installed into the threads' kernel CNode.  For threads without
an exception handler, a null capability can be used, however the consequences
are different per exception handler type.  Before raising an exception the
handler capability is validated. The kernel does not perform another lookup,
but checks that the capability is an endpoint with the correct rights.

The exception endpoint must have Write and either Grant or GrantReply rights.
Replying to the exception message restarts the thread. For certain exception
types, the contents of the reply message may be used to set the values in the
registers of the thread being restarted.  See \autoref{sec:faults} for details.

\subsubsection{Standard Exceptions}

The standard exception handler is used when a fault is triggered by a thread
which cannot be recovered without action by another thread.  For example, if a
thread raises a fault due to an unmapped virtual memory page, the thread cannot
make any more progress until the page is mapped.  If a thread experiences a
fault that would trigger the standard exception handler while it is set to a
null capability, the kernel will pause the thread and it will not run again.
This is because without action by another thread, standard exceptions cannot be
recovered from.  Consequently threads without standard exception handlers
should be trusted not to fault at all.

Standard exception handlers can be passive, in which case they will run on the
scheduling context of the faulting thread.

\subsubsection{Timeout Exceptions (MCS Only)} \label{sec:timeout-exceptions}

Timeout faults are raised when a thread attempts to run but has no available
budget, and if that thread has a valid timeout exception handler capability.
The handling of timeout faults is not compulsory: if a thread does not have a
timeout fault handler, a fault will not be raised and the thread will continue
running when it's budget is replenished.  This allows temporally sensitive
threads to handle budget overruns while other threads may ignore them.

Timeout faults are registered per thread, which means that while clients may
not have a timeout fault handler, servers may, allowing single-threaded,
time-sensitive, passive servers to use a timeout exception handler to recover
from malicious or untrusted clients whose budget expires while the server is
completing the request.  Timeout fault handlers can access server reply objects
and reply with an error to the client, then reset the server to handle the next
client request.

If a reply message is sent to a nested server and a scheduling context without
available budget returned, another timeout fault will be generated if the
nested server also has a timeout fault handler.

\subsection{Message Layout of the Read-/Write-Registers Methods}
\label{sec:read_write_registers}

The registers of a thread can be read and written with the
\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} methods.
For some registers, the kernel will silently mask certain bits or ranges of bits off, and force them to contain certain
values to ensure that they cannot be maliciously set to values that would compromise the running system, or to respect
values that the architecture specifications have mandated to be certain values.
The register contents are transferred via the IPC buffer.

\section{Faults}
\label{sec:faults}

A thread's actions may result in a fault. Faults are delivered to the
thread's exception handler so that it can take the appropriate action.
The fault type is specified in the message label and is one of:

\begin{itemize}
\item \texttt{seL4\_Fault\_CapFault}
\item \texttt{seL4\_Fault\_VMFault}
\item \texttt{seL4\_Fault\_UnknownSyscall}
\item \texttt{seL4\_Fault\_UserException}
\item \texttt{seL4\_Fault\_TimeoutFault}
\item \texttt{seL4\_Fault\_NullFault} (indicating no fault occurred and this is a normal IPC message)
\item \texttt{seL4\_Fault\_VGICMaintenence}
\item \texttt{seL4\_Fault\_VPPIEvent}
\item \texttt{seL4\_Fault\_VCPUFault}
\item \texttt{seL4\_Fault\_DebugException}
\end{itemize}

Faults are delivered in such a way as to imitate a Call from the faulting
thread. This means that to send a fault message the fault endpoint
must have Write and either Grant or GrantReply permissions. If this is not the
case, a double fault happens (generally the thread is simply suspended).

\subsection{Capability Faults}

Capability faults may occur in two places. Firstly, a capability fault
can occur when lookup of a capability referenced by a
\apifunc{seL4\_Call}{sel4_call} or \apifunc{seL4\_Send}{sel4_send} system call
failed (\apifunc{seL4\_NBSend}{sel4_nbsend} calls on
invalid capabilities silently fail). In this case, the capability
on which the fault occurred may be the capability being invoked or an
extra capability passed in the \texttt{caps} field in the IPC buffer.

Secondly, a capability fault can occur when \apifunc{seL4\_Recv}{sel4_recv} or \apifunc{seL4\_NBRecv}{sel4_nbrecv}
is called on a capability that does not exist, is not an endpoint or notification capability or does not have
receive permissions.

Replying to the fault IPC will restart the faulting thread. The contents of the
IPC message are given in \autoref{tbl:ipc_contents}.\\

\begin{table}[htb]
\noindent\begin{tabularx}{\textwidth}{XX}
\toprule
    \textbf{Meaning} & \textbf{ IPC buffer location} \\
\midrule
    Address at which to restart execution & \ipcbloc{seL4\_CapFault\_IP} \\
    Capability address & \ipcbloc{seL4\_CapFault\_Addr} \\
In receive phase (1 if the fault happened during a receive system call, 0
    otherwise) & \ipcbloc{seL4\_CapFault\_InRecvPhase} \\
Lookup failure description. As described in \autoref{sec:lookup_fail_desc} &
    \ipcbloc{seL4\_CapFault\_LookupFailureType} \\
\bottomrule
\end{tabularx}
\caption{\label{tbl:ipc_contents}Contents of an IPC message.}
\end{table}

\subsection{Unknown Syscall}
\label{sec:unknown-syscall}

This fault occurs when a thread executes a system call with a syscall
number that is unknown to seL4.
The register set
of the faulting thread is passed to the thread's exception handler so that it
may, for example, emulate the system call if a thread is being
virtualised.

Replying to the fault IPC allows the thread to be restarted
and/or the thread's register set to be modified. If the reply has
a label of zero, the thread will be restarted. Additionally, if the
message length is non-zero, the faulting thread's register set will be
updated. In this case, the number of
registers updated is controlled with the length field of the message
tag.

\subsection{User Exception}

User exceptions are used to deliver architecture-defined exceptions. For
example, such an exception could occur if a user thread attempted to
divide a number by zero.

Replying to the fault IPC allows the thread to be restarted
and/or the thread's register set to be modified. If the reply has
a label of zero, the thread will be restarted. Additionally, if the
message length is non-zero, the faulting thread's register set will be
updated. In this case, the number of
registers updated is controlled with the length field of the message
tag.

\subsection{Debug Exception: Breakpoints and Watchpoints}
\label{sec:debug_exceptions}

Debug exceptions are used to deliver trace and debug related events to threads.
Breakpoints, watchpoints, trace-events and instruction-performance sampling
events are examples. These events are supported for userspace threads when the kernel
is configured to include them (when CONFIG\_HARDWARE\_DEBUG\_API is set). The hardware
debugging extensions API is supported on the following subset of the platforms that the
kernel has been ported to:

\begin{itemize}
\item PC99: IA-32 and x86\_64
\item Sabrelite (i.MX6)
\item Jetson TegraK1
\item HiSilicon Hikey
\item Raspberry Pi 3
\item Odroid-X (Exynos4)
\item Xilinx zynq7000
\end{itemize}

Information on the available hardware debugging resources is presented in the form of the following constants:

\begin{description}
\item[seL4\_NumHWBreakpoints]: Defines the total number of hardware break
registers available, of all types available on the hardware platform. On the Arm
Cortex~A7 for example, there are 6 exclusive instruction breakpoint registers,
and 4 exclusive data watchpoint registers, for a total of 10 monitor registers.
On this platform therefore, \texttt{seL4\_NumHWBreakpoints} is defined as 10.
The instruction breakpoint registers will always be assigned the lower API-IDs,
and the data watchpoints will always be assigned following them.

Additionally, \texttt{seL4\_NumExclusiveBreakpoints}, \texttt{seL4\_NumExclusiveWatchpoints}
and \texttt{seL4\_NumDualFunctionMonitors}
are defined for each target platform to reflect the number of available
hardware breakpoints/watchpoints of a certain type.

\item[seL4\_NumExclusiveBreakpoints]: Defines the number of hardware registers
capable of generating a fault \textbf{only} on instruction execution. Currently this will be
set only on Arm platforms. The API-ID of the first exclusive breakpoint is given
in \texttt{seL4\_FirstBreakpoint}. If there are no instruction-break exclusive
registers, \texttt{seL4\_NumExclusiveBreakpoints} will be set to \texttt{0} and
\texttt{seL4\_FirstBreakpoint} will be set to -1.

\item[seL4\_NumExclusiveWatchpoints]: Defines the number of hardware registers
capable of generating a fault \textbf{only} on data access. Currently this will be set only
on Arm platforms. The API-ID of the first exclusive watchpoint is given
in \texttt{seL4\_FirstWatchpoint}. If there are no data-break exclusive
registers, \texttt{seL4\_NumExclusiveWatchpoints} will be set to \texttt{0} and
\texttt{seL4\_FirstWatchpoint} will be set to -1.

\item[seL4\_NumDualFunctionMonitors]: Defines the number of hardware registers
capable of generating a fault on either type of access -- i.e, the register
supports both instruction and data breaks. Currently this will be set only on
x86 platforms. The API-ID of the first dual-function monitor is given
in \texttt{seL4\_FirstDualFunctionMonitor}. If there are no dual-function break
registers, \texttt{seL4\_NumDualFunctionMonitors} will be set to \texttt{0} and
\texttt{seL4\_FirstDualFunctionMonitor} will be set to -1.

\end{description}

\begin{table}[ht]
\begin{tabularx}{\textwidth}{XXX}
\toprule
\textbf{Value sent} & \textbf{IPC buffer location} \\
\midrule
\reg{Breakpoint instruction address} & \ipcbloc{IPCBuffer[0]} \\
\reg{Exception reason} & \ipcbloc{IPCBuffer[1]} \\
\reg{Watchpoint data access address} & \ipcbloc{IPCBuffer[2]} \\
\reg{Register API-ID} & \ipcbloc{IPCBuffer[3]} \\
\bottomrule
\end{tabularx}
\caption{\label{tbl:debug_exception_result}Debug fault message layout. The
register API-ID is not returned in the fault message from the kernel on
single-step faults.}
\end{table}

\subsection{Debug Exception: Single-stepping}
\label{sec:single_stepping_debug_exception}

The kernel provides support for the use of hardware single-stepping of userspace
threads when configured to do so (when CONFIG\_HARDWARE\_DEBUG\_API is set). To
this end it exposes the invocation, \texttt{seL4\_TCB\_ConfigureSingleStepping}.

The caller is expected to select an API-ID that corresponds to
an instruction breakpoint, to use when setting up the single-stepping
functionality (i.e, API-ID from 0 to \texttt{seL4\_NumExclusiveBreakpoints} - 1).
However, not all hardware platforms require an actual hardware breakpoint
register to provide single-stepping functionality. If the caller's hardware platform requires the
use of a hardware breakpoint register, it will use the breakpoint register given to it in \texttt{bp\_num},
and return \texttt{true} in \texttt{bp\_was\_consumed}. If the underlying platform does not need a hardware
breakpoint to provide single-stepping, seL4 will return \texttt{false} in \texttt{bp\_was\_consumed} and
leave \texttt{bp\_num} unchanged.

If \texttt{bp\_was\_consumed} is \texttt{true}, the caller should not
attempt to re-configure \texttt{bp\_num} for Breakpoint or Watchpoint usage until
the caller has disabled single-stepping and released that register, via a subsequent
call to \texttt{seL4\_TCB\_ConfigureSingleStepping}, or a fault-reply with
\texttt{n\_instr} being 0. Setting \texttt{num\_instructions} to \texttt{0}
\textbf{disables single stepping}.

On architectures that require an actual hardware registers to be configured for
single-stepping functionality, seL4 will restrict the number of registers that
can be configured as single-steppers, to one at any given time. The register that
is currently configured (if any) for single-stepping will be the implicit
\texttt{bp\_num} argument in a single-step debug fault reply.

% The description for skipping num_instructions had an off-by-one error
% I.e. num_instruction = 1 does not skip any instructions between stops
% num_instruction = 2 "skips" 1 instruction, but executes 2 instructions before stopping

The kernel's single-stepping, also supports executing a certain number of
instructions before delivering the single-step fault message. \texttt{Num\_instructions}
should be set to \texttt{1} when single-stepping, or any non-zero integer value to execute that many
instructions before resuming single-stepping. This execution-count can also be set in
the fault-reply to a single-step debug fault.

\begin{table}[h]
\begin{tabularx}{\textwidth}{XXX}
\toprule
\textbf{Value sent} & \textbf{Register set by reply} & \textbf{IPC buffer location} \\
\midrule
\reg{Breakpoint instruction address} & \texttt{num\_instructions} to execute & \ipcbloc{IPCBuffer[0]} \\
\reg{Exception reason} & --- & \ipcbloc{IPCBuffer[1]} \\
\bottomrule
\end{tabularx}
\caption{\label{tbl:single_step_exception_result}Single-step fault message layout.}
\end{table}

\subsection{Timeout Fault (MCS only)}
\label{sec:timeout-fault}

Timeout faults are raised when a thread consumes all of its budget and has a
timeout fault handler that is not a null capability.  They allow a timeout
exception handler to take some action to restore the thread, and deliver a
message containing the scheduling context data word, as well as the amount of
time consumed since the last timeout fault occurred on this scheduling context,
or since \apifunc{seL4\_SchedContext\_YieldTo}{schedcontext_yieldto} or
\apifunc{seL4\_SchedContext\_Consumed}{schedcontext_consumed} was last called.
Timeout exception handlers can reply to a temporal fault with the registers set
in the same format as outlined in \autoref{sec:read_write_registers}.

\begin{table}[htb] \noindent\begin{tabularx}{\textwidth}{XX} \toprule
    \textbf{Meaning} & \textbf{IPC buffer location} \\ \midrule Data word from
    the scheduling context object that the thread was running on when the fault
    occurred. & \ipcbloc{seL4\_TimeoutFault\_Data} \\ Upper 32-bits of
    microseconds consumed since last reset &
    \ipcbloc{seL4\_TimeoutFault\_Consumed} \\ Lower 32-bits of microseconds
    consumed since last reset & \ipcbloc{seL4\_TimeoutFault\_Consumed\_LowBits}
    \\ \bottomrule \end{tabularx}\\ \\ \caption{\label{tbl:tf_message_32}
    Timeout fault outcome on 32-bit architectures.} \end{table}

% TODO: describe outcome for 64-bit architectures

\subsection{VM Fault}
\label{sec:vm-fault}

The thread caused a page fault. Replying to the fault IPC will restart
the thread. The contents of the IPC message are given below.\\

\begin{table}[htb]
\begin{tabularx}{\textwidth}{XXX}
\toprule
\textbf{Meaning} & \textbf{IPC buffer location} \\
\midrule
    Program counter to restart execution at. & \ipcbloc{seL4\_VMFault\_IP} \\
Address that caused the fault. & \ipcbloc{seL4\_VMFault\_Addr} \\
    Instruction fault (1 if the fault was caused by an instruction fetch). & \ipcbloc{seL4\_VMFault\_PrefetchFault}  \\
Fault status register (FSR). Contains information about the cause of the fault. Architecture dependent. & \ipcbloc{seL4\_VMFault\_FSR} \\
\bottomrule
\end{tabularx}
\caption{\label{tbl:vm_fault_result_arm} VM Fault outcome on all architectures.}
\end{table}


\subsection{Arm Virtualisation Faults}
\label{sec:arm-virt-fault}

Arm with hypervisor support enabled can generate additional exceptions, see \autoref{sec:virt-arm}.
Replying to the fault IPC will restart the VCPU thread.
The contents of the IPC messages are given below.

\begin{table}[h!]
\begin{tabularx}{\textwidth}{XXX}
\toprule
\textbf{Meaning} & \textbf{IPC buffer location} \\
\midrule
    List Register index, -1 when unknown. & \ipcbloc{seL4\_VGICMaintenance\_IDX} \\
\bottomrule
\end{tabularx}
\caption{\label{tbl:fault_arm_vgic} seL4\_Fault\_VGICMaintenance.}
\end{table}

\begin{table}[h!]
\begin{tabularx}{\textwidth}{XXX}
\toprule
\textbf{Meaning} & \textbf{IPC buffer location} \\
\midrule
    Virtual PPI IRQ number. & \ipcbloc{seL4\_VPPIEvent\_IRQ} \\
\bottomrule
\end{tabularx}
\caption{\label{tbl:fault_arm_vppi} seL4\_Fault\_VPPIEvent.}
\end{table}

\begin{table}[h!]
\begin{tabularx}{\textwidth}{XXX}
\toprule
\textbf{Meaning} & \textbf{IPC buffer location} \\
\midrule
    Register value of HSR for aarch32 and ESR for aarch64. & \ipcbloc{seL4\_VCPUFault\_HSR} \\
\bottomrule
\end{tabularx}
\caption{\label{tbl:fault_arm_vcpu} seL4\_Fault\_VCPUFault.}
\end{table}


\section{Domains}
\label{sec:domains}

Domains are used to isolate independent subsystems, so as to limit
information flow between them.
The kernel switches between domains according to a fixed, time-triggered
schedule.
The fixed schedule is compiled into the kernel via the constant
\texttt{CONFIG\_NUM\_DOMAINS} and the global variable \texttt{ksDomSchedule}.

A thread belongs to exactly one domain, and will only run when that domain
is active.
The \apifunc{seL4\_DomainSet\_Set}{domainset_set} method changes the domain
of a thread.
The caller must possess a \obj{Domain} cap and the thread's \obj{TCB} cap.
The initial thread starts with a \obj{Domain} cap (see
\autoref{sec:messageinfo}).

\section{Virtualisation}
\label{sec:virt}

Hardware execution virtualisation is supported on specific arm and x86 platforms. The interface is exposed through a series
of kernel objects, invocations and syscalls that allow the user to take advantage of hardware
virtualisation features.

Hardware virtualisation allows for a thread to perform instructions and operations as if it were
running at a higher privilege level. As higher privilege levels typically have access to
additional machine registers and other pieces of state a \obj{VCPU} object is introduced to act
as storage for this state. For simplicity we refer to this virtualised higher privileged level as
'guest mode'. \obj{VCPU}s are bound in a one-to-one relationship with a \obj{TCB} in order
to provide a thread with this ability to run in higher privilege mode. See the section on
Arm or x86 for more precise details.

\obj{VCPU} objects also have additional, architecture specific, invocations for manipulating
the additional state or other virtualisation controls provided by the hardware. Binding of
a \obj{VCPU} to a \obj{TCB} is done by an invocation on the \obj{VCPU} only, and not the \obj{TCB}.

The provided objects and invocations are, generally speaking, the thinnest possible shim over
the underlying hardware primitives and operations. As a result an in depth familiarity with
the underlying architecture specific hardware mechanisms is required to use these objects, and
such familiarity is therefore assumed in description.

\subsection{Arm}
\label{sec:virt-arm}

When a \obj{TCB} has a bound \obj{VCPU} it will have access to (virtualised) system registers,
cache and TLB maintenance instructions and be able to handle some exceptions itself.
The virtual GIC will be enabled, allowing virtual interrupt delivery.

The virtualised system registers can be modified with \apifunc{seL4\_ARM\_VCPU\_WriteRegs}{arm_vcpu_write_registers}.
By configuring the mode portion of the \texttt{SPSR\_EL1} or \texttt{cpsr} register,
for ARMv8 and ARMv7 respectively, the thread can run in guest kernel mode.

Interrupts are virtualised through the virtual GIC and need to be forwarded with
\apifunc{seL4\_ARM\_VCPU\_InjectIRQ}{arm_vcpu_inject_irq}, which provides a way to manage
Virtual GIC List Registers, a queue of pending IRQs to be delivered to the guest. To help
with managing the list, the Virtual GIC will send GIC maintenance interrupts, which are
delivered as VGIC Maintenance Faults. List Register state is saved and restored on \obj{VCPU}
context switch, but there is currently no way to do that manually.

Shared Peripheral Interrupts (SPIs) can be handled like any normal IRQs and forwarded as required.

Virtual Private Peripheral Interrupts (PPI) are trapped and delivered as VPPI Event faults and need
to be acknowledged with \apifunc{seL4\_ARM\_VCPU\_AckVPPI}{arm_vcpu_acknowledge_virtual_ppi_irq}.

In addition to the above and standard exceptions, others are delivered as VCPU Faults.

Stage 2 translation is enabled when the kernel supports virtualisation. \obj{VCPUs} will have control
over stage 1 translations and stage 2 translations will be used for the rest of the system.
As stage 2 translations use VMIDs instead of ASIDs to distinguish address spaces, VMIDs will be used
to implement seL4 ASIDs. Practically this means that there is an ASID limit of 256 for all threads,
until 16-bit VMIDs are supported. If more ASIDs are needed, ASIDs will be dynamically re-used, with
the associated cache flushing and slowdowns.

\subsection{x86}

A \obj{TCB} with a bound \obj{VCPU} has two execution modes; one is the original thread just as
if there was no bound \obj{VCPU}, and the other is the guest mode execution using the
\obj{VCPU}. Switching from regular execution mode into the guest execution mode is
done by using the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall. Executing this syscall causes the thread, whenever
it is scheduled thereafter, to execute using the higher privileged mode controlled by the \obj{VCPU}.
Should the guest execution mode generate any kind of fault, or if a message arrives
on the \obj{TCB}s bound notification, the \obj{TCB} will be switched back to regular mode
and the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall will return with a message indicating the reason for return.

\obj{VCPU} state and execution is controlled through the \apifunc{seL4\_VCPU\_ReadVMCS}{x86_vcpu_readvmcs}
and \apifunc{seL4\_VCPU\_WriteVMCS}{x86_vcpu_writevmcs} invocations.
These are very thin wrappers around the hardware \texttt{vmread} and \texttt{vmwrite} instructions and the kernel
merely does enough validation on the parameters to ensure the \obj{VCPU} is not configured
to run in such a way as to violate any kernel properties. For example, it is not possible to
disable the use of External Interrupt Exiting, as this would prevent the kernel from receiving
timer interrupts and allow the thread to monopolise CPU time.

Memory access of the guest execution mode is controlled by requiring the use of Extended
Page Tables (EPT). A series of EPT related paging structure objects (\obj{EPTPML4}, \obj{EPTPDPT}, \obj{EPTPD}, \obj{EPTPT})
exist and are manipulated in exactly the same manner as the objects for the regular virtual
address space. Once constructed a \obj{TCB} can be given an \obj{EPTPML4} as an EPT root with \apifunc{seL4\_TCB\_SetEPTRoot}{x86_set_eptroot},
which serves as the VSpace root when executing in guest mode, with the VSpace root set
with \apifunc{seL4\_TCB\_SetSpace}{tcb_setspace} or \apifunc{seL4\_TCB\_Configure}{tcb_configure}
continuing to provide translation when the TCB is executing in its normal mode.

Direct access to I/O ports can be given to the privileged execution mode through the
\apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport} invocation and allows the provided I/O port capability to be
linked to the VCPU, and a subset of its I/O port range to be made accessible to the \obj{VCPU}.
Linking means that an I/O port capability can only be used in a single \apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport}
invocation and a second invocation will undo the previous one. The link also means that
if the I/O port capability is deleted for any reason the access will be correspondingly removed
from the \obj{VCPU}.