tools: add manual for bitfield_gen.py

Add a user manual for the bitfield generator that documents syntax,
semantics and basic concepts. Should also be able to serve as main
spec for what the tool is supposed to do.

Signed-off-by: Gerwin Klein <gerwin.klein@proofcraft.systems>
This commit is contained in:
Gerwin Klein
2023-04-05 18:42:12 +10:00
parent 886fcec2b6
commit 13f0f035cf

673
tools/bitfield_gen.md Normal file
View File

@@ -0,0 +1,673 @@
<!--
Copyright 2023 Proofcraft Pty Ltd
SPDX-License-Identifier: CC-BY-SA-4.0
-->
# Bitfield Generator Manual
## Introduction
The tool `bitfield_gen.py` in the `seL4/tools/` directory generates bitfield
accessor code in C with formal specifications and proofs in Isabelle/HOL.
It takes specifications of bitfield layouts in memory such as the following
```bf_gen
block VMFault {
field address 32
field FSR 5
padding 7
field instructionFault 1
padding 15
field seL4_FaultType 4
}
```
and generates a C struct with constructor and getter/setter functions with
call-by-value and call-by-pointer semantics. It also, separately, generates a
corresponding data type definition in Isabelle/HOL with corresponding functions
in HOL for that data type, as well as proofs that the generated C code correctly
implements these functions. This includes absence of undefined behaviour in C.
The purpose of the tool is to provide bit-level access to structures in memory
with precise, deterministic and reliable layout semantics, with proofs of
correctness, and without the need for manual masking and shifting operations.
The C standard leaves C bitfields too underspecified to be fit for that purpose.
In addition to [basic blocks](#blocks) of bitfields such as above, the tool
supports [tagged unions](#tagged-unions) and various layout options for these
blocks.
## Bit Field Generator by Example
This section walks through the features of the bitfield specification languages
by example. The [full syntax](#syntax-reference) is described in the next
section for reference.
### Blocks
The basic concepts of the specification languages are blocks which contain a
list of fields. The example used in the introduction declares a block with
name `VMFault`, which contains four fields and two padding entries:
```bf_gen
block VMFault {
field address 32
field FSR 5
padding 7
field instructionFault 1
padding 15
field seL4_FaultType 4
}
```
This will generate a type `VMFault_t` in C. Assuming that the word size of the
underlying architecture is 32 bits, this `VMFault_t` type will have a size of
two machine words. The sizes of the content and padding fields in a block must
add up to a multiple of the machine word size.
The bits in the block are laid out in memory in the order they are mentioned in
the specification. That is, the left-most 32 bits (i.e. bits 63 to 32) of
`VMFault` will correspond to the `address` field, the next 5 bits to `FSR`, the
next 7 bits are inaccessible padding which is set to 0 on creation of the
bitfield, then one bit for `instructionFault`, 15 bits of padding, and finally 4
bits (bits 0 to 3 of the entire type) for `seL4_FaultType`.
The following diagram corresponds to the block specification above.
```bf_dia
│------------------------ address ----------------------------│
┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
│ │ ..
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
63 32
instruction fault seL4_FaultType
│-- FSR --│ │-| │-------|
┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
.. │ │ │ │ │ │
└─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
31 27 19 3 0
```
For creating new values of the type `VMFault`, the generator will provide a
function `VMFault` that takes an argument of the word size of the machine for
the value of each field -- for the example above, this is `uint32_t`
```c
VMFault_t VMFault(uint32_t address, uint32_t FSR, uint32_t instructionFault,
uint32_t seL4_FaultType);
```
Values that do not fit in the bit size of the target field will be cut off
accordingly, i.e., as if using `val & mask(size)`.
To read the value of each field, the generator will provide the following
functions:
```c
uint32_t VMFault_get_address(VMFault_t vm_fault);
uint32_t VMFault_get_FSR(VMFault_t vm_fault);
uint32_t VMFault_get_instructionFault(VMFault_t vm_fault);
uint32_t VMFault_get_seL4_FaultType(VMFault_t vm_fault);
```
For setting one of the fields in an existing value of type `VMFault_t`:
```c
VMFault_t VMFault_set_address(VMFault_t vm_fault, uint32_t address);
VMFault_t VMFault_set_FSR(VMFault_t vm_fault, uint32_t FSR);
VMFault_t VMFault_set_instructionFault(VMFault_t vm_fault, uint32_t instructionFault);
VMFault_t VMFault_set_seL4_FaultType(VMFault_t vm_fault, uint32_t seL4_FaultType);
```
All of the above (constructor, getter, setter) are pure functions without side
effects or dependency on global state. They do contain `assert` statements that
are in effect in debug configurations. The generator declares all of them as
`static inline CONST` to enable corresponding compiler optimisations.
Often we expect such bitfields to be accessed via pointers, e.g. as part of a
buffer or a larger struct in memory or for instance a page table. Therefore the
generator also produces getter/setter functions with pointer access:
```c
uint32_t VMFault_ptr_get_address(VMFault_t *vm_fault);
uint32_t VMFault_ptr_get_FSR(VMFault_t *vm_fault);
uint32_t VMFault_ptr_get_instructionFault(VMFault_t *vm_fault);
uint32_t VMFault_ptr_get_seL4_FaultType(VMFault_t *vm_fault);
```
These are declared `PURE`, but not `CONST`, since they depend on global state.
The pointer set functions are:
```c
void VMFault_set_address(VMFault_t *vm_fault, uint32_t address);
void VMFault_set_FSR(VMFault_t *vm_fault, uint32_t FSR);
void VMFault_set_instructionFault(VMFault_t *vm_fault, uint32_t instructionFault);
void VMFault_set_seL4_FaultType(VMFault_t *vm_fault, uint32_t seL4_FaultType);
```
These are neither `PURE` nor `CONST`, but still declared as `static inline`.
> Note that the generator has a command line option `--prune` to skip functions
that are not used, and the seL4 build system makes use of that option. That
means, not necessarily all of the combinations above will appear in the
generated files for seL4, only the ones that are actually used in the rest of
the code.
### Field High
The usual access patterns for fields is that if we provide e.g. a value of 5 to
a setter function, we expect a corresponding 8-bit field to be set to the bit
pattern `0000 0101`.
In general, on a 32-bit architecture, we are providing a 32-bit value to store
in an n-bit field. One way of looking at this is that we know the top 32-n bits
of the value are always 0.
Sometimes we instead know that the *bottom* n bits of the value we provide are
0. This can happen for instance when we want to store an address that is always
aligned to a page boundary, as it would in a page table entry. This means we are
interested in the high bits of the address. The `field_high` tag allows us to
store only the significant high bits of such addresses.
For instance the base pointer stored in a page table cap in seL4 for 32-bit
RISC-V is such a case:
```bf_gen
block page_table_cap {
field capPTMappedASID 9
field_high capPTBasePtr 20
padding 3
padding 7
field capPTIsMapped 1
field_high capPTMappedAddress 20
field capType 4
}
```
The generated access methods of the `field_high capPTBasePtr` have the same type
as the ones for `field`, but only the top 20 bits of the provided value will be
stored in the setter methods as opposed to the bottom 20 bits for normal fields.
The getter methods will automatically correctly return a value that has the top
20 bits set to the stored value and the bottom 12 bits set to 0. That is, getter
and setter remain inverses of each other as long as the values provided to the
setter have the bottom 12 bits set to 0.
On most 64-bit architectures, the top bits of pointers are also determined,
because not all 64 bits are available (e.g. only 48 of them on x64). The
generator assumes `field_high` values to be used for pointers and to need to store
only the significant bits of the pointer, along with potential sign extension.
See the [Architecture Parameters][] section below for details.
For instance, if the platform is declared to have base 64 and 48 significant
bits, and we have a pointer that we know is aligned to 12 bits, we need to store
only 48-12 = 36 bits to recreate the correct original pointer value. That means
if we declare
```bf_gen
block store_more {
field some_info 16
field_high ptr 36
field other_info 12
}
```
we can store a 12-aligned pointer with `sm = store_more_set_ptr(sm, p)` and
extract it again with `p = store_more_get_ptr(sm)` without losing information.
This means, we can store 28 bits of additional information in the space that
would be taken up by a single pointer. For this it does not matter where in the
`store_more` block the `ptr` field is declared, just that it is a `field_high`
that stores 36 bits (on a 64 bit base platform with 48 significant bits).
[Architecture Parameters]: #architecture-parameters-and-canonical-pointers
### Visible Field Order
Sometimes it is useful for the constructor to take its arguments in a particular
order that is different from the field layout order in memory. It is possible to
specify such an order, enumerating the constructor field order in parentheses
after the block name:
```bf_gen
block page_table_cap(capPTBasePtr, capPTMappedASID) {
field capPTMappedASID 9
field_high capPTBasePtr 20
padding 3
}
```
All fields must be enumerated, including the tag field in tagged unions
described in the [next section](#tagged-unions).
Only the constructor is affected, the memory layout as well as getters and
setters remain the same.
### Tagged Unions
Multiple blocks that share a type field can be aggregated as a type in C into a
tagged union. For example to express an algebraic data type such as the
following
```isabelle
datatype cap = NullCap | ObjectCap (ref : "20 word")
```
in C, one might specify the following:
```bf_gen
block null_cap {
padding 28
cap_type 4
}
block object_cap {
field ref 20
padding 8
cap_type 4
}
tagged_union cap cap_type {
tag null_cap 0
tag object_cap 1
}
```
> Note that both of the block declarations for `null_cap` and `object_cap` contain
a field with the name `cap_type` at the same bit position with the same size,
and that all blocks have the same size. We will later relax the size of the tag
field, but the tool does require all blocks in a tagged union to be of the same
size and the tag field to always be at the same position.
The fragment `tagged_union cap cap_type` means that we want the generator to
produce a C type `cap_t` that is a tagged union, and that we want to use
`cap_type` as the tag field that distinguishes which variant of the union we are
operating on.
Listing `tag null_cap 0` means that we want the block `null_cap` to be part of
the tagged union and that the value `0` in field `cap_type` will indicate that
the rest of the fields are `null_cap` fields. Similarly `tag object_cap 1`
means that a value of `1` in the tag field makes the fields of the block
`object_cap` available.
This mapping must be unique, i.e., it is an error for a block to be mentioned in
more than one tagged union.
If a block is mentioned in a tagged union, its memory layout stays the same as
before, but the code that is generated for it changes types to represent the
tagged union instead. All getters and setters for the block will be prefixed
with the name of the union, and instead of the block type, they will take and
return the union type instead (e.g. `cap_t` instead of `object_cap_t` or
`null_cap_t`). The tag field will be omitted from the constructor, because it is
determined by the constructor name, and no setter methods for the tag field will
be generated. Instead there will be a generic getter function for the tag field
which does not need to know which variant of the type it is presented with. It
returns an enum whose names correspond to the block names and whose values
correspond to the associated tag values.
For instance for the specification above, we will get:
```c
// variant tags:
enum cap_tag {
cap_null_cap = 0,
cap_object_cap = 1
};
typedef enum cap_tag cap_tag_t;
// discrimator:
cap_tag_t cap_get_cap_type(cap_t cap);
// constructors:
cap_t cap_null_cap_new(void);
cap_t cap_object_cap_new(uint64_t ref);
// getters and setters for object_cap variant:
uint64_t cap_object_cap_get_ref(cap_t cap);
cap_t cap_object_cap_set_ref(cap_t cap, uint64_t ref);
uint64_t cap_object_cap_ptr_get_ref(cap_t *cap);
void cap_object_cap_ptr_set_ref(cap_t *cap, uint64_t ref);
```
The getter names are constructed as `<union>_<block>_get_<field>`. Setters and
pointer versions have names with `set`, `ptr_get`, and `ptr_set` respectively.
All functions for the tagged union `cap` will start with `cap_`.
> It is a runtime error to use a getter or setter function on the wrong block
variant. That means it is the caller's responsibility to know or check which
variant a value of `cap_t` represents before using e.g.
`cap_object_cap_get_ref()` on it. The tool generates `assert` statements in the
getters and setters. Common idioms are to check the tag in a `switch` or `if`
statement and work on the contents accordingly.
### Advanced Tag Fields
#### Tag Slices
Occasionally the tag value for a tagged union cannot be encoded in a single
field of a block, for instance when the hardware prescribes using e.g. bits 0
and 5..6 for the type of a structure and bits 1..4 for something else.
An example would be the following
```bf_gen
block null {
padding 25
field typ_h 2
padding 4
field typ_l 1
}
block content {
padding 25
field typ_h 2
field some 4
field typ_l 1
}
tagged_union example ex_type(typ_h, typ_l) {
null (0,0)
content (2,0)
}
```
Instead of a single tag field, the `example` union has two tag fields. We also
say that the tag `ex_type` has two *slices* `typ_h` and `typ_l`. The same
restrictions as for normal tags apply: they must fit the tag values and all tag
slice fields must be mentioned in all blocks of the union.
The fragment `ex_type(typ_h, typ_l)` means that we want the discriminator
function to be called `example_get_ex_type` and that the tag is made up of the
two fields `(typ_h, typ_l)`. Any positive number of fields is allowed. The
generated enumeration will still be called `example_tag` as it would be for a
single tag, and it will still contain a name for each alternative bound to a
value unique for that variant. (The values consist of the concatenation of the
bit pattern of the tag slice values).
Similarly to a single tag field, the tool will not generate getters and setters
for any of the tag fields mentioned in the union specification.
The values in the `null (0,0)` and `content (2,0)` are positional, i.e., the
example assigns the bit pattern `0b10` in field `typ_h` and `0` in field `typ_l`
to mean the `content` block and `0` in both to mean the `null` block.
#### Masks
The restriction that all tag fields have to have the same size can be relaxed
slightly. One can declare multiple possible tag sizes together with a bit mask
each that determines which of these sizes applies.
Say for instance that we have more than 16 different `cap` types, and that some
of the `cap` alternatives only have 4 bits of storage left to store the cap tag,
but others have further bits available, e.g. 8. We can declare a bit mask that
determines if the tag size is 4 or 8 and use the size 4 tags in the blocks that
have fewer bits left.
The following example achieves this:
```bf_gen
block null_cap {
padding 24
cap_type 8
}
block object_cap {
field ref 28
cap_type 4
}
tagged_union cap cap_type {
mask 4 0x0e
mask 8 0x0e
tag null_cap 15
tag object_cap 1
}
```
The block `null_cap` has an 8-bit tag field and the block `object_cap` has a
4-bit tag field of the same name. Note that both still must start at the same
bit positions.
The declarations `mask 4 0x0e` and `mask 8 0x0e` are read from the smallest to
the largest mask size. `mask 4 0x0e` declares that the default tag size is 4
bits wide, but if the bits `0b1110` (0x0e) are set, i.e. if `tag_value & 0x0e ==
0x0e`, then the next larger mask will be used. This means values 0 to 13 are
using 4 bit tags, 14 and 15 are using 8 bit tags, and all 8 bit tag values must
have at least the bits 1, 2, and 3 set.
Since there is no larger mask size after 8 in the example, the mask value there
is not important. However, the generator expects the mask of each larger level
to have at least all bits set that the masks of all smaller sizes have set, so
we chose the same value `0x0e` in the example. If we had another level, e.g. 16
bits wide afterwards, we might have chosen `0b11110` (0x1e) for the mask value
at 8 bit.
The sub mask inclusion means that the generator can check mask sizes
consecutively. I.e. if `tag_value & 0xe != 0xe`, we always know it is a 4 bit
tag, no matter how many other tag sizes are declared.
The tag mask value declared for each alternative in the tagged union must fit
the mask for the size of the tag field in the block. Remember that with `mask 4
0x0e` the `0x0e` bit pattern means that size 4 does *not* apply, and instead
the next larger size applies. So in the example, because `object_cap` has a
4-bit tag field, `object_cap` must have a tag value where the bits `0x0e` are
not all set (e.g. 1 is a legal value). Conversely, `null_cap` has an 8-bit tag,
and so at least the bits `0x0e` must all be set in the tag value (e.g. 14 or 15
works, as does 255, but not 1).
Masks and tag slices are mutually exclusive, i.e. only one of these can be used
for any particular tagged union.
### Raw Value Access
The C representation of blocks, whether in a tagged union or not, is a struct
containing an array of the architecture word type (`uint32` or `uint64`) of the
length computed from the number and width of fields in the block. Given that the
memory layout is fully specified it is possible to operate on these raw values
directly without breaking the semantics of the generated functions.
This is usually neither recommended nor necessary, and for code undergoing
formal verification strongly discouraged, because it introduces additional
verification overhead.
However, one special case is occasionally convenient: when initialising an area
of memory with 0, and reinterpreting that memory as a bitfield structure, we are
guaranteed to get a value where all fields are 0. Similarly, returning an all-0
struct from a function instead of calling the constructor with only 0 parameters
may be acceptable, although one should check the generated assembly if the
compiler was not able to optimise the constructor away -- if it does then the
constructor form is clearer and easier to use in proofs.
The tool will not access padding fields, but is not guaranteed to preserve them
when copying. Writing/reading padding fields is not guaranteed to be stable.
### Architecture Parameters and Canonical Pointers
The tool supports architectures with 32 and 64 bit word sizes. The command
```bf_gen
base 32
```
selects a 32 bit native word size. The declaration covers all subsequent blocks
and unions in the specification. It is possible to give multiple declarations,
each covering the subsequent blocks and unions, but it rarely makes sense to
pick any other value than the word size of the architecture.
Many 64 bit architectures have the concept of canonical pointers where not all
64 bits of the available space are usable for pointers, but only for example 48
bits. In this setting a pointer value would be canonical if and only if all top
bits are set when bit 47 (counting from 0) is set. This is equivalent to sign
extension, i.e. treating bit 47 as the sign bit and extending that sign bit up
to bit 63.
A potential use for this is for kernel pointers to always have bit 47 set (and
all bits above) and for user pointers to always have bit 47 unset (and all bits
above).
The bitfield generator allows specifying this canonical bit and whether the top
bits for pointers should be sign extended or not like this:
```bf_gen
base 64(48,1)
```
This means: word size is 64, pointers are 48 bits `[0..47]`, and bit 48 and
above should be set to 1 if bit 47 is set.
This only affects `field_high` fields which are assumed to be pointers.
```bf_gen
base 64(48,0)
```
means word size is 64, pointers are at most 48 bits, and no sign extension. This
is currently equivalent to just `base 64`.
It may make sense to use the `base` command multiple times for the sign
extension case. For instance, if the intention is to usually store pointers on
x64, one would declare `base 64(48,1)`, but if there is a specific block that
wants to not store a pointer, but some other value as `field_high`, one could
declare `base 64` just before and `base 64(48,1)` again just after that block.
## Syntax Reference
### Lexical Structure
The generator distinguishes identifiers, integer literals, comments, white
space, keywords, and operators. Keywords and operators are shown inline in the
grammar in the next section. The rest are defined below.
A multi-character ident must start with a letter or underscore, followed by
letters, underscores or numbers. A one-character ident must consist of one
letter. Letters are Latin letters `[a-z]` and `[A-Z]`, digits are `[0-9]`.
```bnf
IDENT ::= [A-Za-z_][a-zA-Z0-9_]+ | [A-Za-z]
```
In addition to the regular expression above, identifiers that consist only of
underscores are invalid. Valid examples are `a`, `a_20`, `_a_20_b`. Invalid
examples are `0`, `_`, `0a` or `0_a`.
Integer literals can be decimal, octal, hex, or binary, followed by an optional
`l` or `L` specifier.
```bnf
INTLIT ::= ( [1-9][0-9]*
| 0[oO]?[0-7]+
| 0[xX][0-9a-fA-F]+
| 0[bB][01]+
| 0 ) [lL]?'
```
Valid examples are `15`, `017`, `0o17`, `0x0F`, `0xF`, `0XF`, `0b1111`, `0`,
`0l`, and `15L`. Invalid examples are `10A0`, `0xFG0`, `0b20`, `090`.
Comments start with `--` or with `#` and go until the end of the line. There
are no block comments.
```bf_gen
-- this is a valid comment
## as is this
```
Whitespace are spaces and tab characters. They delimit other tokens. Consecutive
whitespace is ignored.
### Syntactic Structure
```ebnf
spec :== ( base | block | tagged_union )*
base ::= "base" ("32"|"64") canonical_spec?
canonical_spec ::= "(" INTLIT "," ("0"|"1") ")"
block ::= "block" IDENT visible_order_spec_opt? "{" fields "}"
visible_order_spec_opt ::= "(" visible_order_spec? ")"
visible_order_spec ::= IDENT ("," IDENT)*
fields ::= ( "field" IDENT INTLIT | "field_high" IDENT INTLIT | "padding" INTLIT )*
tagged_union ::= "tagged_union" IDENT IDENT tag_slices? "{" masks tags "}"
tag_slices ::= "(" IDENT ("," IDENT)* ")"
masks ::= ( "mask" INTLIT INTLIT )*
tags ::= ( "tag" IDENT tag_value )*
tag_value ::= INTLIT | "(" INTLIT ( "," INTLIT )* ")"
```
Forward references to names that are declared later in the file are allowed.
### Restrictions
The following restrictions are not represented by the grammar in the previous
section, but will be checked by the generator.
- the canonical bit must be smaller than the `base` size
- field sizes must not be larger than the `base` size
- the field sizes in a block must add up to multiples of the `base` size
- all identifiers mentioned as tags in a union must be blocks
- all blocks in a tagged union must have the same size
- all blocks in a tagged union must mention the tag field(s)
- all blocks in a tagged union must have the tag field(s) at the same bit
position
- all blocks in a tagged union must have tag fields of one of the sizes declared
in the union, or if no mask size is declared, the tag fields must all be of
the same size.
- each block can be used in at most one tagged union
- visible_order_spec must enumerate all fields if used
- tag values must fit into the size of the tag fields
- tag values must not be larger than an `int` of the underlying platform,
because they will be used in an `enum` in C.
## Correctness
Since the tool generates both the implementation and the specification, the
question arises what kind of correctness property we gain. Even if the generated
proofs are checked for correctness by Isabelle/HOL, they could still be
expressing properties that are not useful or specify behaviour that we do not
want the code to have. What we do know after Isabelle/HOL has checked the proofs
is that the code implements the generated specification and that the generated C
code is free from undefined behaviour.
To gain confidence that the generated specification is meaningful and itself has
the right properties it needs human inspection or it needs to be used in further
proofs. In the seL4 verification both are the case. In particular, the generated
specification together with the generated implementation proofs are the
interface that the C verification uses when it encounters bitfield functions. If
the generated specifications are trivial, not useful, wrong, or do not express
the behaviour of the generated C (e.g. if the generated correctness proof was
trivial), these proofs could not succeed, because the specification would then
not provide the needed semantics of the code (e.g. that loading a value after
storing it does not lose information). If the specification provides strong
enough properties to make the rest of the C verification succeed, we know that
the specification is strong enough for our needs, even though it might still be
missing additional facts or properties that the generated C code might have.
This is sufficient for strong functional correctness.
## Limitations
The tool does not currently support proofs for all combinations of pointer
access functions for variants, but it does support all those that are used in
seL4. If proof support for new combinations is needed, the tool could be
extended relatively easily.
<!--
TODO:
- semantics (Isabelle?)
- pruning
- command line options
- preprocessing used in seL4
- file extension .bf
- where can I find the generated code
- where can I find the bitfield definitions in seL4
-->