forked from Imagelibrary/seL4
Add flags to tcb_t and the seL4_TCBFlag_fpuDisabled flag. Enums are signed, make TCB flags word_t to make it unsigned. Signed-off-by: Indan Zupancic <indan@nul.nu> Signed-off-by: Corey Lewis <corey.lewis@proofcraft.systems>
789 lines
42 KiB
TeX
789 lines
42 KiB
TeX
%
|
|
% Copyright 2014, General Dynamics C4 Systems
|
|
%
|
|
% SPDX-License-Identifier: GPL-2.0-only
|
|
%
|
|
|
|
\chapter{\label{ch:threads}Threads and Execution}
|
|
|
|
\section{Threads}
|
|
\label{sec:threads}
|
|
|
|
|
|
seL4 provides threads to represent an execution context. On MCS configurations of
|
|
the kernel, scheduling contexts are used to manage processor time. Without MCS, processor
|
|
time is also represented by the thread abstraction.
|
|
A thread is represented in seL4 by its thread control block
|
|
object (\obj{TCB}).
|
|
|
|
With MCS, a scheduling context is represented by a scheduling context object
|
|
(\obj{SCO}), and threads cannot run unless they are bound to, or receive a
|
|
scheduling context.
|
|
|
|
\subsection{Thread control blocks}
|
|
|
|
Each \obj{TCB} has an associated CSpace (see
|
|
\autoref{ch:cspace}) and VSpace (see \autoref{ch:vspace}) which
|
|
may be shared with other threads. A \obj{TCB} may also have an IPC buffer
|
|
(see \autoref{ch:ipc}), which is used to pass extra arguments during IPC
|
|
or kernel object invocation that do not fit in the architecture-defined message
|
|
registers. While it is not compulsory that a thread has an IPC buffer,
|
|
it will not be able to perform most kernel invocations, as they require
|
|
cap transfer.
|
|
Each thread belongs to exactly one security domain (see
|
|
\autoref{sec:domains}).
|
|
%FIXME: there is much more information held in the TCB!
|
|
|
|
\subsection{Thread Creation}
|
|
\label{sec:thread_creation}
|
|
|
|
Like other objects, \obj{TCB}s are created with the
|
|
\apifunc{seL4\_Untyped\_Retype}{untyped_retype} method (see
|
|
\autoref{sec:kernmemalloc}). A newly created thread is initially inactive. It
|
|
is configured by setting its CSpace and VSpace with the
|
|
\apifunc{seL4\_TCB\_SetSpace}{tcb_setspace}
|
|
or \apifunc{seL4\_TCB\_Configure}{tcb_configure} methods and then calling
|
|
\apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} with an initial stack pointer and instruction
|
|
pointer. The thread can then be activated either by setting the
|
|
\texttt{resume\_target} parameter in the \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} invocation to true
|
|
or by separately calling the \apifunc{seL4\_TCB\_Resume}{tcb_resume} method. Both of these methods
|
|
place the thread in a runnable state.
|
|
|
|
In non-MCS configurations of the kernel, this will result in the thread immediately being added to
|
|
the scheduler. On the MCS kernel, the thread will only begin running if it has a
|
|
scheduling context object.
|
|
|
|
In a SMP configuration of the kernel, the thread will resume on the node
|
|
corresponding to the affinity of the thread. For non-MCS configurations, the
|
|
default thread affinity is the node the thread's \obj{TCB} object was created
|
|
on, and \apifunc{seL4\_TCB\_SetAffinity}{tcb_setaffinity} can be used to
|
|
explicitly set the affinity. On MCS configurations, the affinity is derived
|
|
from the scheduling context object (see \autoref{sec:sc_creation}).
|
|
|
|
\subsection{Thread Deactivation}
|
|
\label{sec:thread_deactivation}
|
|
|
|
The \apifunc{seL4\_TCB\_Suspend}{tcb_suspend} method deactivates a thread.
|
|
Suspended threads can later be resumed.
|
|
Their suspended state can be retrieved with the
|
|
\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and
|
|
\apifunc{seL4\_TCB\_CopyRegisters}{tcb_copyregisters} methods.
|
|
They can also be reconfigured and
|
|
reused or left suspended indefinitely if not needed. Threads will be
|
|
automatically suspended when the last capability to their \obj{TCB} is
|
|
deleted.
|
|
% an example of which is demonstrated in \nameref{ex:second_thread}.
|
|
|
|
\subsection{Thread Feature Flags}
|
|
\label{sec:thread_flags}
|
|
Specific features can be enabled or disabled on a per TCB basis with
|
|
\apifunc{seL4\_TCB\_SetFlags}{tcb_setflags}.
|
|
|
|
If access to the Floating Point Unit is not needed, it can be disabled
|
|
for individual threads to maximise context switching speed.
|
|
|
|
\subsection{Affinity}
|
|
\label{sec:thread_affinity}
|
|
|
|
It is architecture and platform specific, how an affinity value maps to a
|
|
specific node (core, hart) on a specific platform. There is no guarantee that
|
|
affinity values are compatible across different platforms.
|
|
|
|
|
|
\subsection{Scheduling}
|
|
\label{sec:sched}
|
|
|
|
seL4 uses a preemptive, tickless scheduler with 256 priority levels (0 --- 255). All threads have
|
|
a maximum controlled priority (MCP) and a priority, the latter being the effective priority of the
|
|
thread.
|
|
When a thread modifies another thread's priority (including itself) it must provide a
|
|
thread capability from which to use the MCP from. Threads can only set priorities and MCPs
|
|
to be less than or equal to the provided thread's MCP.
|
|
The initial task starts with an MCP and priority as the highest priority in the system (\texttt{seL4\_MaxPrio}).
|
|
Thread priority and MCP can be
|
|
set with \apifunc{seL4\_TCB\_SetSchedParams}{tcb_setschedparams} and
|
|
\apifunc{seL4\_TCB\_SetPriority}{tcb_setpriority},
|
|
\apifunc{seL4\_TCB\_SetMCPriority}{tcb_setmcpriority} methods.
|
|
|
|
|
|
Of threads eligible for scheduling, the highest priority thread in a runnable state is chosen.
|
|
|
|
Thread priority (structure \texttt{seL4\_PrioProps\_t}) consists of two values as follows:
|
|
|
|
\begin{description} \item[Priority] the priority a thread will be scheduled with. \item[Maximum
|
|
controlled priority (MCP)] the highest priority a thread can set itself or another thread to.
|
|
\end{description}
|
|
|
|
\subsection{MCS Scheduling}
|
|
|
|
This section only applies to configurations with MCS enabled, where threads must have
|
|
a scheduling context object available in order to be admitted to the scheduler.
|
|
|
|
\subsection{Scheduling Contexts}
|
|
\label{sec:scheduling_contexts}
|
|
|
|
Access to CPU execution time is controlled through scheduling context objects.
|
|
Scheduling contexts are configured with a tuple of
|
|
\textit{budget}$(b)$ and \textit{period} $(p)$, both in microseconds, set by
|
|
\apifunc{seL4\_SchedControl\_Configure\_Flags}{schedcontrol_configureflags} (see \autoref{sec:sc_creation}).
|
|
The tuple $(b, p)$ forms an upper bound on the thread's execution -- the kernel will not permit a
|
|
thread to run for more than $b$ out of every $p$ microseconds. However, $\frac{b}{p}$ does not
|
|
represent a lower bound on execution, as a thread must have the highest or equal highest priority
|
|
of all runnable threads to be guaranteed to be scheduled at all, and the kernel does not conduct
|
|
an admission test. As a result the set of all parameters is not necessarily schedulable. If
|
|
multiple threads have budgets available concurrently they are scheduled first-in first-out, and
|
|
round-robin scheduling is applied once the budget is expired.
|
|
|
|
A scheduling context that is eligible to be picked by the scheduler, i.e has budget available, is
|
|
referred to as \emph{active}. Budget charging and replenishment rules are different for round-robin
|
|
and sporadic threads. For round-robin threads, the budget is charged each time the current node's
|
|
scheduling context is changed, until it is depleted and then refilled immediately.
|
|
|
|
Threads where $b == p$ are treated as round robin threads, where $b$ acts as a timeslice.
|
|
Otherwise the kernel uses \emph{sporadic servers} to enforce temporal isolation, which enforce the
|
|
property that $\frac{b}{p}$ cannot be exceeded for all possible $p$.
|
|
In theory, sporadic servers provide temporal isolation -- preventing threads from exceeding their allocated budget -- by using the following algorithm:
|
|
|
|
\begin{itemize}
|
|
\item When a thread starts executing at current time $T$, record $T_{s}$
|
|
\item When a thread stops executing (blocks or is preempted), schedule a replenishment at $T_{s} + p$ for the
|
|
amount of time consumed ($T - T_{s}$) and subtract that from the current replenishment being used.
|
|
\end{itemize}
|
|
|
|
seL4 implements this algorithm by maintaining an ordered list of sporadic replenishments --
|
|
\texttt{refills} for brevity -- in each scheduling context. Each replenishment contains a tuple
|
|
of the time it is eligible for use (\emph{rTime}) and the amount that replenishment is for
|
|
(\texttt{rAmount}).
|
|
While a thread is executing, it constantly drains the budget from the \texttt{rAmount} at the head of the
|
|
replenishment list. If the \texttt{rTime} is in the future, the thread bound to that
|
|
scheduling context is placed in a queue of threads waiting for more budget.
|
|
|
|
Round-robin threads are treated that same as sporadic threads except regarding one aspect: how the
|
|
budget is charged. Round-robin threads have two refills only, both of which are always ready to be
|
|
used. When a round-robin thread stops executing, budget is moved from the head to the tail
|
|
replenishment. Once the head budget is consumed, the thread is removed from the scheduling queue
|
|
for its priority and appended at the tail.
|
|
|
|
Sporadic threads behave differently depending on the amount of replenishments available, which
|
|
must be bounded. Developers have two options to configure the size of the replenishment list:
|
|
|
|
\begin{itemize}
|
|
\item The maximum number of refills in a single scheduling context is determined by the size
|
|
of the scheduling context when created by \apifunc{seL4\_Untyped\_Retype}{untyped_retype}.
|
|
\item A per scheduling context parameter, \texttt{extra\_refills} that
|
|
limits the number of refills for that specific scheduling context. This value is added to the base
|
|
value of 2 and is limited by the size of the scheduling context.
|
|
\end{itemize}
|
|
|
|
Threads that have short execution times (e.g interrupt handlers) and are not frequently preempted
|
|
should have less refills, while longer running threads with long values of $b$ should have a higher
|
|
value. Threads bound to a scheduling context with 0 extra refills will run periodically -- tasks
|
|
that use their head replenishment, or call yield, will not be scheduled again until the start of
|
|
their next period.
|
|
|
|
Given the number of replenishments is limited, if a node's SC changes and the outgoing SC does not
|
|
have enough space to store the new replenishment, space is created by removing the current
|
|
replenishment which can result in preemption if the next replenishment is not yet available.
|
|
Scheduling contexts with a higher number of configured refills will consume closer
|
|
to their whole budget, as they can be preempted and switch threads more often without filling their
|
|
replenishment queue. However, the scheduling overhead will be higher as the replenishment list is
|
|
subject to fragmentation.
|
|
|
|
Whenever a thread is executing it consumes the budget from its current scheduling context. The
|
|
system call \apifunc{seL4\_Yield}{sel4_yield} can be used to sacrifice any remaining budget and
|
|
block until the next replenishment is ready to be used.
|
|
|
|
Threads can be bound to scheduling contexts using \apifunc{seL4\_TCB\_Configure}{tcb_configure} or
|
|
\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind}, both invocations have the same effect
|
|
although \apifunc{seL4\_TCB\_Configure}{tcb_configure} allows more thread fields to be set with only
|
|
one kernel entry. When a thread is bound to a scheduling context, if it is in a runnable state and
|
|
the scheduling context is active, it will be added to the scheduler.
|
|
|
|
\subsection{Passive Threads} \label{sec:passive}
|
|
|
|
Threads can be unbound from a scheduling context with
|
|
\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}. This is distinct from
|
|
suspending a thread, in that threads that are blocked waiting in an endpoint or notification queue
|
|
will remain in the queue and can still receive messages and signals. However, the unbound thread
|
|
will not be schedul\-able again until it receives a scheduling context. Threads without scheduling
|
|
contexts are referred to as \emph{passive} threads, as they cannot execute without the action of
|
|
another thread.
|
|
|
|
\subsection{Scheduling Context Creation} \label{sec:sc_creation}
|
|
|
|
Like other objects, scheduling contexts are created from untyped memory using
|
|
\apifunc{seL4\_UntypedRetype}{untyped_retype}. On creation, scheduling contexts are empty,
|
|
representing 0\% of CPU execution time. To populate a scheduling context with parameters, one must
|
|
invoke the appropriate \obj{SchedControl} capability, which provides access to CPU time management
|
|
on a single node. A scheduling control cap for each node is provided to the initial task at run
|
|
time. Threads run on the node that their scheduling context is configured for. Scheduling context
|
|
parameters can then be set and updated using
|
|
\apifunc{seL4\_SchedControl\_ConfigureFlags}{schedcontrol_configureflags}, which allows the budget and period
|
|
to be specified along with a bitwise OR'd set of the following flags.
|
|
|
|
\begin{description}
|
|
|
|
\item[seL4\_SchedContext\_Sporadic]: constrain the execution time only according to the
|
|
sporadic server algorithm rather than to a continuous constant bandwidth.
|
|
|
|
\end{description}
|
|
|
|
The kernel does not conduct any schedulability tests, as task admission is left to user-level policy
|
|
and can be conducted online or offline, statically or dynamically or not at all.
|
|
|
|
\subsection{Scheduling Context Donation}
|
|
|
|
In addition to explicitly binding and removing scheduling contexts through
|
|
\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind} and
|
|
\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}, scheduling contexts can move
|
|
between threads over IPC. Scheduling contexts are donated implicitly when the system calls
|
|
\apifunc{seL4\_Call}{sel4_call} and \apifunc{seL4\_NBSendRecv}{sel4_nbsendrecv} are used to
|
|
communicate with a passive thread. When an active thread invokes an endpoint with
|
|
\apifunc{seL4\_Call}{sel4_call} and rendezvous with a passive thread, the active thread's scheduling
|
|
context is donated to the passive thread. The generated reply cap ensures that the callee is merely
|
|
borrowing the scheduling context: when the reply cap is consumed by a reply message being sent the
|
|
scheduling context will be returned to the caller. If the reply cap is revoked, and the callee
|
|
holds the scheduling context, the scheduling context will be returned to the caller. However, if in
|
|
a deep call chain and a reply cap in the middle of the call chain is revoked, such that the callee
|
|
does not possess the scheduling context, the thread will be removed from the call chain and the
|
|
scheduling context will remain where it is. If the receiver does not provide a reply object to
|
|
track the donation in (i.e uses \apifunc{seL4\_Wait}{sel4_wait} instead of
|
|
\apifunc{seL4\_Recv}{sel4_recv} scheduling context donation will not occur but the message will be
|
|
delivered. The passive receiver will be set to inactive as it does not have a scheduling context.
|
|
|
|
Consider an example where thread A calls thread B which calls thread C. If whilst C holds the
|
|
scheduling context, B's reply cap to A is revoked, then the scheduling context will remain with C.
|
|
However, a call chain will remain between A and C, such that if C's reply cap is revoked, or
|
|
invoked, the scheduling context will return to A.
|
|
|
|
\apifunc{seL4\_NBSendRecv}{sel4_nbsendrecv} can also result in scheduling context donation.
|
|
If the non-blocking send phase of the operation results in message delivery to a passive thread, the
|
|
scheduling context will be donated to that passive thread and the thread making the system call becomes passive on the
|
|
receiving endpoint in the receive phase. No reply capability is generated, so there
|
|
is no guarantee that the scheduling context will return, which increases book keeping complexity but allows
|
|
for data-flow like architectures rather than remote-procedure calls. Note that \apifunc{seL4\_Call}{sel4_call}
|
|
does not guarantee the return of a scheduling context: this is an inherently trusted operation as the
|
|
server could never reply and return the scheduling context.
|
|
|
|
Scheduling contexts can also be bound to notification objects using
|
|
\apifunc{seL4\_SchedContext\_Bind}{schedcontext_bind} and unbound using
|
|
\apifunc{seL4\_SchedContext\_UnbindObject}{schedcontext_unbindobject}. If a signal is delivered to
|
|
a notification object with a passive thread blocked waiting on it, the passive thread will receive
|
|
the scheduling context that is bound to the notification object. The scheduling context is returned
|
|
when the thread blocks on the notification object. This feature allows for passive servers to use
|
|
notification binding (See \autoref{sec:notification-binding}). If a scheduling context is bound to
|
|
both a notification object and a thread, the behaviour will be the same as for a passive server:
|
|
The scheduling context will be unbound from the thread when it blocks on the bound notification object.
|
|
This is useful when launching passive servers or handling timeout exceptions.
|
|
|
|
Scheduling contexts can be unbound from all objects (notification objects and TCBs that are bound or
|
|
have received a scheduling context through donation) using
|
|
\apifunc{seL4\_SchedContext\_Unbind}{schedcontext_unbind}.
|
|
|
|
Passive threads will run on the CPU node that the scheduling context was configured with, and will
|
|
be migrated on IPC.
|
|
|
|
\subsection{Scheduling algorithm} \label{sec:mcs-sched}
|
|
|
|
Threads are only eligible for scheduling if they have an active scheduling context.
|
|
Of threads eligible for scheduling, the highest priority thread in a runnable state is chosen.
|
|
|
|
Threads of sufficient maximum controlled priority and with possession of the
|
|
appropriate scheduling context capability can manipulate the scheduler and
|
|
implement user-level schedulers using IPC.
|
|
|
|
Scheduling contexts provide access to an upper bound on execution CPU time,
|
|
however when a thread executes is determined by thread priority. Consequently,
|
|
access to CPU is a function of thread MCPs, scheduling contexts and the
|
|
\obj{SchedControl} capability. The kernel will enforce that threads do not
|
|
exceed the budget in their scheduling context for any given period, and that
|
|
the highest priority thread will always run, however it is up to the system
|
|
designer to make sure the entire system is schedulable.
|
|
|
|
|
|
\subsection{Exceptions}
|
|
\label{sec:exceptions}
|
|
|
|
Each thread has two associated exception-handler endpoints, a \emph{standard}
|
|
exception handler and a \emph{timeout} exception handler, where the latter is MCS
|
|
only. If the thread causes
|
|
an exception, the kernel creates an IPC message with the relevant details and
|
|
sends this to the endpoint. This thread can then take the appropriate action.
|
|
Fault IPC messages are described in \autoref{sec:faults}. Standard exception-handler
|
|
endpoints can be set with the \apifunc{seL4\_TCB\_SetSpace}{tcb_setspace} or
|
|
\apifunc{seL4\_TCB\_SetSchedParams}{tcb_setschedparams} methods while Timeout exception
|
|
handlers can be set with \apifunc{seL4\_TCB\_SetTimeoutEndpoint}{tcb_settimeoutendpoint} (MCS only).
|
|
With these methods, a
|
|
capability address for the exception handler can be associated with a thread.
|
|
This address is then used to lookup the handler endpoint, and the capability to
|
|
the endpoint is installed into the threads' kernel CNode. For threads without
|
|
an exception handler, a null capability can be used, however the consequences
|
|
are different per exception handler type. Before raising an exception the
|
|
handler capability is validated. The kernel does not perform another lookup,
|
|
but checks that the capability is an endpoint with the correct rights.
|
|
|
|
The exception endpoint must have Write and either Grant or GrantReply rights.
|
|
Replying to the exception message restarts the thread. For certain exception
|
|
types, the contents of the reply message may be used to set the values in the
|
|
registers of the thread being restarted. See \autoref{sec:faults} for details.
|
|
|
|
\subsubsection{Standard Exceptions}
|
|
|
|
The standard exception handler is used when a fault is triggered by a thread
|
|
which cannot be recovered without action by another thread. For example, if a
|
|
thread raises a fault due to an unmapped virtual memory page, the thread cannot
|
|
make any more progress until the page is mapped. If a thread experiences a
|
|
fault that would trigger the standard exception handler while it is set to a
|
|
null capability, the kernel will pause the thread and it will not run again.
|
|
This is because without action by another thread, standard exceptions cannot be
|
|
recovered from. Consequently threads without standard exception handlers
|
|
should be trusted not to fault at all.
|
|
|
|
Standard exception handlers can be passive, in which case they will run on the
|
|
scheduling context of the faulting thread.
|
|
|
|
\subsubsection{Timeout Exceptions (MCS Only)} \label{sec:timeout-exceptions}
|
|
|
|
Timeout faults are raised when a thread attempts to run but has no available
|
|
budget, and if that thread has a valid timeout exception handler capability.
|
|
The handling of timeout faults is not compulsory: if a thread does not have a
|
|
timeout fault handler, a fault will not be raised and the thread will continue
|
|
running when it's budget is replenished. This allows temporally sensitive
|
|
threads to handle budget overruns while other threads may ignore them.
|
|
|
|
Timeout faults are registered per thread, which means that while clients may
|
|
not have a timeout fault handler, servers may, allowing single-threaded,
|
|
time-sensitive, passive servers to use a timeout exception handler to recover
|
|
from malicious or untrusted clients whose budget expires while the server is
|
|
completing the request. Timeout fault handlers can access server reply objects
|
|
and reply with an error to the client, then reset the server to handle the next
|
|
client request.
|
|
|
|
If a reply message is sent to a nested server and a scheduling context without
|
|
available budget returned, another timeout fault will be generated if the
|
|
nested server also has a timeout fault handler.
|
|
|
|
\subsection{Message Layout of the Read-/Write-Registers Methods}
|
|
\label{sec:read_write_registers}
|
|
|
|
The registers of a thread can be read and written with the
|
|
\apifunc{seL4\_TCB\_ReadRegisters}{tcb_readregisters} and \apifunc{seL4\_TCB\_WriteRegisters}{tcb_writeregisters} methods.
|
|
For some registers, the kernel will silently mask certain bits or ranges of bits off, and force them to contain certain
|
|
values to ensure that they cannot be maliciously set to values that would compromise the running system, or to respect
|
|
values that the architecture specifications have mandated to be certain values.
|
|
The register contents are transferred via the IPC buffer.
|
|
|
|
\section{Faults}
|
|
\label{sec:faults}
|
|
|
|
A thread's actions may result in a fault. Faults are delivered to the
|
|
thread's exception handler so that it can take the appropriate action.
|
|
The fault type is specified in the message label and is one of:
|
|
|
|
\begin{itemize}
|
|
\item \texttt{seL4\_Fault\_CapFault}
|
|
\item \texttt{seL4\_Fault\_VMFault}
|
|
\item \texttt{seL4\_Fault\_UnknownSyscall}
|
|
\item \texttt{seL4\_Fault\_UserException}
|
|
\item \texttt{seL4\_Fault\_TimeoutFault}
|
|
\item \texttt{seL4\_Fault\_NullFault} (indicating no fault occurred and this is a normal IPC message)
|
|
\item \texttt{seL4\_Fault\_VGICMaintenence}
|
|
\item \texttt{seL4\_Fault\_VPPIEvent}
|
|
\item \texttt{seL4\_Fault\_VCPUFault}
|
|
\item \texttt{seL4\_Fault\_DebugException}
|
|
\end{itemize}
|
|
|
|
Faults are delivered in such a way as to imitate a Call from the faulting
|
|
thread. This means that to send a fault message the fault endpoint
|
|
must have Write and either Grant or GrantReply permissions. If this is not the
|
|
case, a double fault happens (generally the thread is simply suspended).
|
|
|
|
\subsection{Capability Faults}
|
|
|
|
Capability faults may occur in two places. Firstly, a capability fault
|
|
can occur when lookup of a capability referenced by a
|
|
\apifunc{seL4\_Call}{sel4_call} or \apifunc{seL4\_Send}{sel4_send} system call
|
|
failed (\apifunc{seL4\_NBSend}{sel4_nbsend} calls on
|
|
invalid capabilities silently fail). In this case, the capability
|
|
on which the fault occurred may be the capability being invoked or an
|
|
extra capability passed in the \texttt{caps} field in the IPC buffer.
|
|
|
|
Secondly, a capability fault can occur when \apifunc{seL4\_Recv}{sel4_recv} or \apifunc{seL4\_NBRecv}{sel4_nbrecv}
|
|
is called on a capability that does not exist, is not an endpoint or notification capability or does not have
|
|
receive permissions.
|
|
|
|
Replying to the fault IPC will restart the faulting thread. The contents of the
|
|
IPC message are given in \autoref{tbl:ipc_contents}.\\
|
|
|
|
\begin{table}[htb]
|
|
\noindent\begin{tabularx}{\textwidth}{XX}
|
|
\toprule
|
|
\textbf{Meaning} & \textbf{ IPC buffer location} \\
|
|
\midrule
|
|
Address at which to restart execution & \ipcbloc{seL4\_CapFault\_IP} \\
|
|
Capability address & \ipcbloc{seL4\_CapFault\_Addr} \\
|
|
In receive phase (1 if the fault happened during a receive system call, 0
|
|
otherwise) & \ipcbloc{seL4\_CapFault\_InRecvPhase} \\
|
|
Lookup failure description. As described in \autoref{sec:lookup_fail_desc} &
|
|
\ipcbloc{seL4\_CapFault\_LookupFailureType} \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\caption{\label{tbl:ipc_contents}Contents of an IPC message.}
|
|
\end{table}
|
|
|
|
\subsection{Unknown Syscall}
|
|
\label{sec:unknown-syscall}
|
|
|
|
This fault occurs when a thread executes a system call with a syscall
|
|
number that is unknown to seL4.
|
|
The register set
|
|
of the faulting thread is passed to the thread's exception handler so that it
|
|
may, for example, emulate the system call if a thread is being
|
|
virtualised.
|
|
|
|
Replying to the fault IPC allows the thread to be restarted
|
|
and/or the thread's register set to be modified. If the reply has
|
|
a label of zero, the thread will be restarted. Additionally, if the
|
|
message length is non-zero, the faulting thread's register set will be
|
|
updated. In this case, the number of
|
|
registers updated is controlled with the length field of the message
|
|
tag.
|
|
|
|
\subsection{User Exception}
|
|
|
|
User exceptions are used to deliver architecture-defined exceptions. For
|
|
example, such an exception could occur if a user thread attempted to
|
|
divide a number by zero.
|
|
|
|
Replying to the fault IPC allows the thread to be restarted
|
|
and/or the thread's register set to be modified. If the reply has
|
|
a label of zero, the thread will be restarted. Additionally, if the
|
|
message length is non-zero, the faulting thread's register set will be
|
|
updated. In this case, the number of
|
|
registers updated is controlled with the length field of the message
|
|
tag.
|
|
|
|
\subsection{Debug Exception: Breakpoints and Watchpoints}
|
|
\label{sec:debug_exceptions}
|
|
|
|
Debug exceptions are used to deliver trace and debug related events to threads.
|
|
Breakpoints, watchpoints, trace-events and instruction-performance sampling
|
|
events are examples. These events are supported for userspace threads when the kernel
|
|
is configured to include them (when CONFIG\_HARDWARE\_DEBUG\_API is set). The hardware
|
|
debugging extensions API is supported on the following subset of the platforms that the
|
|
kernel has been ported to:
|
|
|
|
\begin{itemize}
|
|
\item PC99: IA-32 and x86\_64
|
|
\item Sabrelite (i.MX6)
|
|
\item Jetson TegraK1
|
|
\item HiSilicon Hikey
|
|
\item Raspberry Pi 3
|
|
\item Odroid-X (Exynos4)
|
|
\item Xilinx zynq7000
|
|
\end{itemize}
|
|
|
|
Information on the available hardware debugging resources is presented in the form of the following constants:
|
|
|
|
\begin{description}
|
|
\item[seL4\_NumHWBreakpoints]: Defines the total number of hardware break
|
|
registers available, of all types available on the hardware platform. On the Arm
|
|
Cortex~A7 for example, there are 6 exclusive instruction breakpoint registers,
|
|
and 4 exclusive data watchpoint registers, for a total of 10 monitor registers.
|
|
On this platform therefore, \texttt{seL4\_NumHWBreakpoints} is defined as 10.
|
|
The instruction breakpoint registers will always be assigned the lower API-IDs,
|
|
and the data watchpoints will always be assigned following them.
|
|
|
|
Additionally, \texttt{seL4\_NumExclusiveBreakpoints}, \texttt{seL4\_NumExclusiveWatchpoints}
|
|
and \texttt{seL4\_NumDualFunctionMonitors}
|
|
are defined for each target platform to reflect the number of available
|
|
hardware breakpoints/watchpoints of a certain type.
|
|
|
|
\item[seL4\_NumExclusiveBreakpoints]: Defines the number of hardware registers
|
|
capable of generating a fault \textbf{only} on instruction execution. Currently this will be
|
|
set only on Arm platforms. The API-ID of the first exclusive breakpoint is given
|
|
in \texttt{seL4\_FirstBreakpoint}. If there are no instruction-break exclusive
|
|
registers, \texttt{seL4\_NumExclusiveBreakpoints} will be set to \texttt{0} and
|
|
\texttt{seL4\_FirstBreakpoint} will be set to -1.
|
|
|
|
\item[seL4\_NumExclusiveWatchpoints]: Defines the number of hardware registers
|
|
capable of generating a fault \textbf{only} on data access. Currently this will be set only
|
|
on Arm platforms. The API-ID of the first exclusive watchpoint is given
|
|
in \texttt{seL4\_FirstWatchpoint}. If there are no data-break exclusive
|
|
registers, \texttt{seL4\_NumExclusiveWatchpoints} will be set to \texttt{0} and
|
|
\texttt{seL4\_FirstWatchpoint} will be set to -1.
|
|
|
|
\item[seL4\_NumDualFunctionMonitors]: Defines the number of hardware registers
|
|
capable of generating a fault on either type of access -- i.e, the register
|
|
supports both instruction and data breaks. Currently this will be set only on
|
|
x86 platforms. The API-ID of the first dual-function monitor is given
|
|
in \texttt{seL4\_FirstDualFunctionMonitor}. If there are no dual-function break
|
|
registers, \texttt{seL4\_NumDualFunctionMonitors} will be set to \texttt{0} and
|
|
\texttt{seL4\_FirstDualFunctionMonitor} will be set to -1.
|
|
|
|
\end{description}
|
|
|
|
\begin{table}[ht]
|
|
\begin{tabularx}{\textwidth}{XXX}
|
|
\toprule
|
|
\textbf{Value sent} & \textbf{IPC buffer location} \\
|
|
\midrule
|
|
\reg{Breakpoint instruction address} & \ipcbloc{IPCBuffer[0]} \\
|
|
\reg{Exception reason} & \ipcbloc{IPCBuffer[1]} \\
|
|
\reg{Watchpoint data access address} & \ipcbloc{IPCBuffer[2]} \\
|
|
\reg{Register API-ID} & \ipcbloc{IPCBuffer[3]} \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\caption{\label{tbl:debug_exception_result}Debug fault message layout. The
|
|
register API-ID is not returned in the fault message from the kernel on
|
|
single-step faults.}
|
|
\end{table}
|
|
|
|
\subsection{Debug Exception: Single-stepping}
|
|
\label{sec:single_stepping_debug_exception}
|
|
|
|
The kernel provides support for the use of hardware single-stepping of userspace
|
|
threads when configured to do so (when CONFIG\_HARDWARE\_DEBUG\_API is set). To
|
|
this end it exposes the invocation, \texttt{seL4\_TCB\_ConfigureSingleStepping}.
|
|
|
|
The caller is expected to select an API-ID that corresponds to
|
|
an instruction breakpoint, to use when setting up the single-stepping
|
|
functionality (i.e, API-ID from 0 to \texttt{seL4\_NumExclusiveBreakpoints} - 1).
|
|
However, not all hardware platforms require an actual hardware breakpoint
|
|
register to provide single-stepping functionality. If the caller's hardware platform requires the
|
|
use of a hardware breakpoint register, it will use the breakpoint register given to it in \texttt{bp\_num},
|
|
and return \texttt{true} in \texttt{bp\_was\_consumed}. If the underlying platform does not need a hardware
|
|
breakpoint to provide single-stepping, seL4 will return \texttt{false} in \texttt{bp\_was\_consumed} and
|
|
leave \texttt{bp\_num} unchanged.
|
|
|
|
If \texttt{bp\_was\_consumed} is \texttt{true}, the caller should not
|
|
attempt to re-configure \texttt{bp\_num} for Breakpoint or Watchpoint usage until
|
|
the caller has disabled single-stepping and released that register, via a subsequent
|
|
call to \texttt{seL4\_TCB\_ConfigureSingleStepping}, or a fault-reply with
|
|
\texttt{n\_instr} being 0. Setting \texttt{num\_instructions} to \texttt{0}
|
|
\textbf{disables single stepping}.
|
|
|
|
On architectures that require an actual hardware registers to be configured for
|
|
single-stepping functionality, seL4 will restrict the number of registers that
|
|
can be configured as single-steppers, to one at any given time. The register that
|
|
is currently configured (if any) for single-stepping will be the implicit
|
|
\texttt{bp\_num} argument in a single-step debug fault reply.
|
|
|
|
% The description for skipping num_instructions had an off-by-one error
|
|
% I.e. num_instruction = 1 does not skip any instructions between stops
|
|
% num_instruction = 2 "skips" 1 instruction, but executes 2 instructions before stopping
|
|
|
|
The kernel's single-stepping, also supports executing a certain number of
|
|
instructions before delivering the single-step fault message. \texttt{Num\_instructions}
|
|
should be set to \texttt{1} when single-stepping, or any non-zero integer value to execute that many
|
|
instructions before resuming single-stepping. This execution-count can also be set in
|
|
the fault-reply to a single-step debug fault.
|
|
|
|
\begin{table}[h]
|
|
\begin{tabularx}{\textwidth}{XXX}
|
|
\toprule
|
|
\textbf{Value sent} & \textbf{Register set by reply} & \textbf{IPC buffer location} \\
|
|
\midrule
|
|
\reg{Breakpoint instruction address} & \texttt{num\_instructions} to execute & \ipcbloc{IPCBuffer[0]} \\
|
|
\reg{Exception reason} & --- & \ipcbloc{IPCBuffer[1]} \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\caption{\label{tbl:single_step_exception_result}Single-step fault message layout.}
|
|
\end{table}
|
|
|
|
\subsection{Timeout Fault (MCS only)}
|
|
\label{sec:timeout-fault}
|
|
|
|
Timeout faults are raised when a thread consumes all of its budget and has a
|
|
timeout fault handler that is not a null capability. They allow a timeout
|
|
exception handler to take some action to restore the thread, and deliver a
|
|
message containing the scheduling context data word, as well as the amount of
|
|
time consumed since the last timeout fault occurred on this scheduling context,
|
|
or since \apifunc{seL4\_SchedContext\_YieldTo}{schedcontext_yieldto} or
|
|
\apifunc{seL4\_SchedContext\_Consumed}{schedcontext_consumed} was last called.
|
|
Timeout exception handlers can reply to a temporal fault with the registers set
|
|
in the same format as outlined in \autoref{sec:read_write_registers}.
|
|
|
|
\begin{table}[htb] \noindent\begin{tabularx}{\textwidth}{XX} \toprule
|
|
\textbf{Meaning} & \textbf{IPC buffer location} \\ \midrule Data word from
|
|
the scheduling context object that the thread was running on when the fault
|
|
occurred. & \ipcbloc{seL4\_TimeoutFault\_Data} \\ Upper 32-bits of
|
|
microseconds consumed since last reset &
|
|
\ipcbloc{seL4\_TimeoutFault\_Consumed} \\ Lower 32-bits of microseconds
|
|
consumed since last reset & \ipcbloc{seL4\_TimeoutFault\_Consumed\_LowBits}
|
|
\\ \bottomrule \end{tabularx}\\ \\ \caption{\label{tbl:tf_message_32}
|
|
Timeout fault outcome on 32-bit architectures.} \end{table}
|
|
|
|
% TODO: describe outcome for 64-bit architectures
|
|
|
|
\subsection{VM Fault}
|
|
\label{sec:vm-fault}
|
|
|
|
The thread caused a page fault. Replying to the fault IPC will restart
|
|
the thread. The contents of the IPC message are given below.\\
|
|
|
|
\begin{table}[htb]
|
|
\begin{tabularx}{\textwidth}{XXX}
|
|
\toprule
|
|
\textbf{Meaning} & \textbf{IPC buffer location} \\
|
|
\midrule
|
|
Program counter to restart execution at. & \ipcbloc{seL4\_VMFault\_IP} \\
|
|
Address that caused the fault. & \ipcbloc{seL4\_VMFault\_Addr} \\
|
|
Instruction fault (1 if the fault was caused by an instruction fetch). & \ipcbloc{seL4\_VMFault\_PrefetchFault} \\
|
|
Fault status register (FSR). Contains information about the cause of the fault. Architecture dependent. & \ipcbloc{seL4\_VMFault\_FSR} \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\caption{\label{tbl:vm_fault_result_arm} VM Fault outcome on all architectures.}
|
|
\end{table}
|
|
|
|
|
|
\subsection{Arm Virtualisation Faults}
|
|
\label{sec:arm-virt-fault}
|
|
|
|
Arm with hypervisor support enabled can generate additional exceptions, see \autoref{sec:virt-arm}.
|
|
Replying to the fault IPC will restart the VCPU thread.
|
|
The contents of the IPC messages are given below.
|
|
|
|
\begin{table}[h!]
|
|
\begin{tabularx}{\textwidth}{XXX}
|
|
\toprule
|
|
\textbf{Meaning} & \textbf{IPC buffer location} \\
|
|
\midrule
|
|
List Register index, -1 when unknown. & \ipcbloc{seL4\_VGICMaintenance\_IDX} \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\caption{\label{tbl:fault_arm_vgic} seL4\_Fault\_VGICMaintenance.}
|
|
\end{table}
|
|
|
|
\begin{table}[h!]
|
|
\begin{tabularx}{\textwidth}{XXX}
|
|
\toprule
|
|
\textbf{Meaning} & \textbf{IPC buffer location} \\
|
|
\midrule
|
|
Virtual PPI IRQ number. & \ipcbloc{seL4\_VPPIEvent\_IRQ} \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\caption{\label{tbl:fault_arm_vppi} seL4\_Fault\_VPPIEvent.}
|
|
\end{table}
|
|
|
|
\begin{table}[h!]
|
|
\begin{tabularx}{\textwidth}{XXX}
|
|
\toprule
|
|
\textbf{Meaning} & \textbf{IPC buffer location} \\
|
|
\midrule
|
|
Register value of HSR for aarch32 and ESR for aarch64. & \ipcbloc{seL4\_VCPUFault\_HSR} \\
|
|
\bottomrule
|
|
\end{tabularx}
|
|
\caption{\label{tbl:fault_arm_vcpu} seL4\_Fault\_VCPUFault.}
|
|
\end{table}
|
|
|
|
|
|
\section{Domains}
|
|
\label{sec:domains}
|
|
|
|
Domains are used to isolate independent subsystems, so as to limit
|
|
information flow between them.
|
|
The kernel switches between domains according to a fixed, time-triggered
|
|
schedule.
|
|
The fixed schedule is compiled into the kernel via the constant
|
|
\texttt{CONFIG\_NUM\_DOMAINS} and the global variable \texttt{ksDomSchedule}.
|
|
|
|
A thread belongs to exactly one domain, and will only run when that domain
|
|
is active.
|
|
The \apifunc{seL4\_DomainSet\_Set}{domainset_set} method changes the domain
|
|
of a thread.
|
|
The caller must possess a \obj{Domain} cap and the thread's \obj{TCB} cap.
|
|
The initial thread starts with a \obj{Domain} cap (see
|
|
\autoref{sec:messageinfo}).
|
|
|
|
\section{Virtualisation}
|
|
\label{sec:virt}
|
|
|
|
Hardware execution virtualisation is supported on specific arm and x86 platforms. The interface is exposed through a series
|
|
of kernel objects, invocations and syscalls that allow the user to take advantage of hardware
|
|
virtualisation features.
|
|
|
|
Hardware virtualisation allows for a thread to perform instructions and operations as if it were
|
|
running at a higher privilege level. As higher privilege levels typically have access to
|
|
additional machine registers and other pieces of state a \obj{VCPU} object is introduced to act
|
|
as storage for this state. For simplicity we refer to this virtualised higher privileged level as
|
|
'guest mode'. \obj{VCPU}s are bound in a one-to-one relationship with a \obj{TCB} in order
|
|
to provide a thread with this ability to run in higher privilege mode. See the section on
|
|
Arm or x86 for more precise details.
|
|
|
|
\obj{VCPU} objects also have additional, architecture specific, invocations for manipulating
|
|
the additional state or other virtualisation controls provided by the hardware. Binding of
|
|
a \obj{VCPU} to a \obj{TCB} is done by an invocation on the \obj{VCPU} only, and not the \obj{TCB}.
|
|
|
|
The provided objects and invocations are, generally speaking, the thinnest possible shim over
|
|
the underlying hardware primitives and operations. As a result an in depth familiarity with
|
|
the underlying architecture specific hardware mechanisms is required to use these objects, and
|
|
such familiarity is therefore assumed in description.
|
|
|
|
\subsection{Arm}
|
|
\label{sec:virt-arm}
|
|
|
|
When a \obj{TCB} has a bound \obj{VCPU} it will have access to (virtualised) system registers,
|
|
cache and TLB maintenance instructions and be able to handle some exceptions itself.
|
|
The virtual GIC will be enabled, allowing virtual interrupt delivery.
|
|
|
|
The virtualised system registers can be modified with \apifunc{seL4\_ARM\_VCPU\_WriteRegs}{arm_vcpu_write_registers}.
|
|
By configuring the mode portion of the \texttt{SPSR\_EL1} or \texttt{cpsr} register,
|
|
for ARMv8 and ARMv7 respectively, the thread can run in guest kernel mode.
|
|
|
|
Interrupts are virtualised through the virtual GIC and need to be forwarded with
|
|
\apifunc{seL4\_ARM\_VCPU\_InjectIRQ}{arm_vcpu_inject_irq}, which provides a way to manage
|
|
Virtual GIC List Registers, a queue of pending IRQs to be delivered to the guest. To help
|
|
with managing the list, the Virtual GIC will send GIC maintenance interrupts, which are
|
|
delivered as VGIC Maintenance Faults. List Register state is saved and restored on \obj{VCPU}
|
|
context switch, but there is currently no way to do that manually.
|
|
|
|
Shared Peripheral Interrupts (SPIs) can be handled like any normal IRQs and forwarded as required.
|
|
|
|
Virtual Private Peripheral Interrupts (PPI) are trapped and delivered as VPPI Event faults and need
|
|
to be acknowledged with \apifunc{seL4\_ARM\_VCPU\_AckVPPI}{arm_vcpu_acknowledge_virtual_ppi_irq}.
|
|
|
|
In addition to the above and standard exceptions, others are delivered as VCPU Faults.
|
|
|
|
Stage 2 translation is enabled when the kernel supports virtualisation. \obj{VCPUs} will have control
|
|
over stage 1 translations and stage 2 translations will be used for the rest of the system.
|
|
As stage 2 translations use VMIDs instead of ASIDs to distinguish address spaces, VMIDs will be used
|
|
to implement seL4 ASIDs. Practically this means that there is an ASID limit of 256 for all threads,
|
|
until 16-bit VMIDs are supported. If more ASIDs are needed, ASIDs will be dynamically re-used, with
|
|
the associated cache flushing and slowdowns.
|
|
|
|
\subsection{x86}
|
|
|
|
A \obj{TCB} with a bound \obj{VCPU} has two execution modes; one is the original thread just as
|
|
if there was no bound \obj{VCPU}, and the other is the guest mode execution using the
|
|
\obj{VCPU}. Switching from regular execution mode into the guest execution mode is
|
|
done by using the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall. Executing this syscall causes the thread, whenever
|
|
it is scheduled thereafter, to execute using the higher privileged mode controlled by the \obj{VCPU}.
|
|
Should the guest execution mode generate any kind of fault, or if a message arrives
|
|
on the \obj{TCB}s bound notification, the \obj{TCB} will be switched back to regular mode
|
|
and the \apifunc{seL4\_VMEnter}{sel4_vmenter} syscall will return with a message indicating the reason for return.
|
|
|
|
\obj{VCPU} state and execution is controlled through the \apifunc{seL4\_VCPU\_ReadVMCS}{x86_vcpu_readvmcs}
|
|
and \apifunc{seL4\_VCPU\_WriteVMCS}{x86_vcpu_writevmcs} invocations.
|
|
These are very thin wrappers around the hardware \texttt{vmread} and \texttt{vmwrite} instructions and the kernel
|
|
merely does enough validation on the parameters to ensure the \obj{VCPU} is not configured
|
|
to run in such a way as to violate any kernel properties. For example, it is not possible to
|
|
disable the use of External Interrupt Exiting, as this would prevent the kernel from receiving
|
|
timer interrupts and allow the thread to monopolise CPU time.
|
|
|
|
Memory access of the guest execution mode is controlled by requiring the use of Extended
|
|
Page Tables (EPT). A series of EPT related paging structure objects (\obj{EPTPML4}, \obj{EPTPDPT}, \obj{EPTPD}, \obj{EPTPT})
|
|
exist and are manipulated in exactly the same manner as the objects for the regular virtual
|
|
address space. Once constructed a \obj{TCB} can be given an \obj{EPTPML4} as an EPT root with \apifunc{seL4\_TCB\_SetEPTRoot}{x86_set_eptroot},
|
|
which serves as the VSpace root when executing in guest mode, with the VSpace root set
|
|
with \apifunc{seL4\_TCB\_SetSpace}{tcb_setspace} or \apifunc{seL4\_TCB\_Configure}{tcb_configure}
|
|
continuing to provide translation when the TCB is executing in its normal mode.
|
|
|
|
Direct access to I/O ports can be given to the privileged execution mode through the
|
|
\apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport} invocation and allows the provided I/O port capability to be
|
|
linked to the VCPU, and a subset of its I/O port range to be made accessible to the \obj{VCPU}.
|
|
Linking means that an I/O port capability can only be used in a single \apifunc{seL4\_X86\_VCPU\_EnableIOPort}{x86_vcpu_enableioport}
|
|
invocation and a second invocation will undo the previous one. The link also means that
|
|
if the I/O port capability is deleted for any reason the access will be correspondingly removed
|
|
from the \obj{VCPU}.
|