Files
seL4/tools/bitfield_gen.md
Gerwin Klein e9c89bcac0 bfgen: add field_ptr(align) name size
Add the new block command

    field_ptr(align) name size

as an abbreviation for

    padding size - canonical + align
    field_high name canonical - align

The (align) part is optional. If left out, align is 0.

This is useful for removing #ifdefs for pointer storage that are trying
to achieve constant size over different canonical bit representations.

Signed-off-by: Gerwin Klein <gerwin.klein@proofcraft.systems>
2026-01-09 09:00:43 +11:00

36 KiB

Bitfield Generator Manual

Introduction

The tool bitfield_gen.py in the seL4/tools/ directory generates bitfield accessor code in C with formal specifications and proofs in Isabelle/HOL.

It takes specifications of bitfield layouts in memory such as the following

block VMFault {
    field     address           32

    field     FSR               5
    padding                     7
    field     instructionFault  1
    padding                     15
    field     seL4_FaultType    4
}

and generates a C struct with constructor and getter/setter functions with call-by-value and call-by-pointer semantics. It also, separately, generates a corresponding data type definition in Isabelle/HOL with corresponding functions in HOL for that data type, as well as proofs that the generated C code correctly implements these functions. This includes absence of undefined behaviour in C.

The purpose of the tool is to provide bit-level access to structures in memory with precise, deterministic and reliable layout semantics, with proofs of correctness, and without the need for manual masking and shifting operations. The C standard leaves C bitfields too underspecified to be fit for that purpose.

In addition to basic blocks of bitfields such as above, the tool supports tagged unions and various layout options for these blocks.

Bit Field Generator by Example

This section walks through the features of the bitfield specification languages by example. The full syntax is described in the next section for reference.

Blocks

The basic concepts of the specification languages are blocks which contain a list of fields. The example used in the introduction declares a block with name VMFault, which contains four fields and two padding entries:

block VMFault {
    field     address           32

    field     FSR               5
    padding                     7
    field     instructionFault  1
    padding                     15
    field     seL4_FaultType    4
}

This will generate a type VMFault_t in C. Assuming that the word size of the underlying architecture is 32 bits, this VMFault_t type will have a size of two machine words. The sizes of the content and padding fields in a block must add up to a multiple of the machine word size.

The bits in the block are laid out in memory in the order they are mentioned in the specification. That is, the left-most 32 bits (i.e. bits 63 to 32) of VMFault will correspond to the address field, the next 5 bits to FSR, the next 7 bits are inaccessible padding which is set to 0 on creation of the bitfield, then one bit for instructionFault, 15 bits of padding, and finally 4 bits (bits 0 to 3 of the entire type) for seL4_FaultType.

The following diagram corresponds to the block specification above.

   │------------------------  address  ----------------------------│
   ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
   │                                                               │ ..
   └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
   63                                                             32

                     instruction fault                  seL4_FaultType
   │-- FSR --│             │-|                             │-------|
   ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐
.. │         │             │ │                             │       │
   └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘
    31      27              19                              3     0

For creating new values of the type VMFault, the generator will provide a function VMFault that takes an argument of the word size of the machine for the value of each field -- for the example above, this is uint32_t

VMFault_t VMFault(uint32_t address, uint32_t FSR, uint32_t instructionFault,
                  uint32_t seL4_FaultType);

Values that do not fit in the bit size of the target field will be cut off accordingly, i.e., as if using val & mask(size).

To read the value of each field, the generator will provide the following functions:

uint32_t VMFault_get_address(VMFault_t vm_fault);
uint32_t VMFault_get_FSR(VMFault_t vm_fault);
uint32_t VMFault_get_instructionFault(VMFault_t vm_fault);
uint32_t VMFault_get_seL4_FaultType(VMFault_t vm_fault);

For setting one of the fields in an existing value of type VMFault_t:

VMFault_t VMFault_set_address(VMFault_t vm_fault, uint32_t address);
VMFault_t VMFault_set_FSR(VMFault_t vm_fault, uint32_t FSR);
VMFault_t VMFault_set_instructionFault(VMFault_t vm_fault, uint32_t instructionFault);
VMFault_t VMFault_set_seL4_FaultType(VMFault_t vm_fault, uint32_t seL4_FaultType);

All of the above (constructor, getter, setter) are pure functions without side effects or dependency on global state. They do contain assert statements that are in effect in debug configurations. The generator declares all of them as static inline CONST to enable corresponding compiler optimisations.

Often we expect such bitfields to be accessed via pointers, e.g. as part of a buffer or a larger struct in memory or for instance a page table. Therefore the generator also produces getter/setter functions with pointer access:

uint32_t VMFault_ptr_get_address(VMFault_t *vm_fault);
uint32_t VMFault_ptr_get_FSR(VMFault_t *vm_fault);
uint32_t VMFault_ptr_get_instructionFault(VMFault_t *vm_fault);
uint32_t VMFault_ptr_get_seL4_FaultType(VMFault_t *vm_fault);

These are declared PURE, but not CONST, since they depend on global state.

The pointer set functions are:

void VMFault_set_address(VMFault_t *vm_fault, uint32_t address);
void VMFault_set_FSR(VMFault_t *vm_fault, uint32_t FSR);
void VMFault_set_instructionFault(VMFault_t *vm_fault, uint32_t instructionFault);
void VMFault_set_seL4_FaultType(VMFault_t *vm_fault, uint32_t seL4_FaultType);

These are neither PURE nor CONST, but still declared as static inline.

Note that the generator has a command line option --prune to skip functions that are not used, and the seL4 build system makes use of that option. That means, not necessarily all of the combinations above will appear in the generated files for seL4, only the ones that are actually used in the rest of the code.

Expressions

All positions in the specification that allow numbers also allow basic arithmetic integer expressions with +, -, *, /, and %. The commands field, padding, field_high, and field_ptr also allow the constants word_size and canonical_size. word_size stands for the number of bits in a word, and canonical_size for the number of canonical bits in a pointer, both according most recent base declaration (see also Architecture Parameters).

For example, the VMFault example from Section Blocks could have been written as follows, assuming a base 32 definition for `word_size.

block VMFault {
    field     address           word_size

    field     FSR               5
    padding                     7
    field     instructionFault  1
    padding                     word_size - (5 + 7 + 1) - 4
    field     seL4_FaultType    4
}

Field High

The usual access patterns for fields is that if we provide e.g. a value of 5 to a setter function, we expect a corresponding 8-bit field to be set to the bit pattern 0000 0101.

In general, on a 32-bit architecture, we are providing a 32-bit value to store in an n-bit field. One way of looking at this is that we know the top 32-n bits of the value are always 0.

Sometimes we instead know that the bottom n bits of the value we provide are 0. This can happen for instance when we want to store an address that is always aligned to a page boundary, as it would in a page table entry. This means we are interested in the high bits of the address. The field_high tag allows us to store only the significant high bits of such addresses.

For instance the base pointer stored in a page table cap in seL4 for 32-bit RISC-V is such a case:

block page_table_cap {
    field       capPTMappedASID     9
    field_high  capPTBasePtr        20
    padding                         3

    padding                         7
    field       capPTIsMapped       1
    field_high  capPTMappedAddress  20
    field       capType             4
}

The generated access methods of the field_high capPTBasePtr have the same type as the ones for field, but only the top 20 bits of the provided value will be stored in the setter methods as opposed to the bottom 20 bits for normal fields.

The getter methods will automatically correctly return a value that has the top 20 bits set to the stored value and the bottom 12 bits set to 0. That is, getter and setter remain inverses of each other as long as the values provided to the setter have the bottom 12 bits set to 0.

On most 64-bit architectures, the top bits of pointers are also determined, because not all 64 bits are available (e.g. only 48 of them on x64). The generator assumes field_high values to be used for pointers and to need to store only the significant bits of the pointer, along with potential sign extension. See the Architecture Parameters section below for details.

For instance, if the platform is declared to have base 64 and 48 significant bits, and we have a pointer that we know is aligned to 12 bits, we need to store only 48-12 = 36 bits to recreate the correct original pointer value. That means if we declare

block store_more {
    field       some_info  16
    field_high  ptr        36
    field       other_info 12
}

we can store a 12-aligned pointer with sm = store_more_set_ptr(sm, p) and extract it again with p = store_more_get_ptr(sm) without losing information. This means, we can store 28 bits of additional information in the space that would be taken up by a single pointer. For this it does not matter where in the store_more block the ptr field is declared, just that it is a field_high that stores 36 bits (on a 64 bit base platform with 48 significant bits).

Field Ptr

Often we want to store a pointer with some known alignment (i.e. not all bits need to be stored), and pad out to a specific overall field size. The field_ptr abbreviation makes this convenient.

block some_block {
    ...
    field_ptr(align) some_name size
    ...
}

stands for "a field some_name for a pointer with alignment align padded out to size". It is an abbreviation for

block some_block {
    ...
    padding size - canonical_size + align
    field_high some_name canonical_size - align
    ...
}

The (align) parameter can be left out and is equivalent to field_ptr(0). The constant canonical_size in the expression above stands for the number of significant bits (the canonical size) of a pointer according to the most recent base definition. For example the long form

base 64(48,1)

block mdb_node {
    padding 16
    field_high mdbNext 46
    field mdbRevocable 1
    field mdbFirstBadged 1

    field mdbPrev 64
}

can be written more concisely as

base 64(48,1)

block mdb_node {
    field_ptr(2) mdbNext 62
    field mdbRevocable 1
    field mdbFirstBadged 1

    field mdbPrev 64
}

The field_ptr abbreviation is especially useful for sharing the same block definition under different base declarations, e.g. for different architectures. For example, with a base 64(39,1) the field_ptr definition text remains the same, but expands to a different field_high/padding combination that achieves the same overall width:

base 64(39,1)

block mdb_node {
    padding 25
    field_high mdbNext 37
    field mdbRevocable 1
    field mdbFirstBadged 1

    field mdbPrev 64
}

can be also be written as

base 64(39,1)

block mdb_node {
    field_ptr(2) mdbNext 62
    field mdbRevocable 1
    field mdbFirstBadged 1

    field mdbPrev 64
}

This means the same block definition can produce a constant field size while correctly accommodating multiple different base configurations.

Visible Field Order

Sometimes it is useful for the constructor to take its arguments in a particular order that is different from the field layout order in memory. It is possible to specify such an order, enumerating the constructor field order in parentheses after the block name:

block page_table_cap(capPTBasePtr, capPTMappedASID) {
    field       capPTMappedASID     9
    field_high  capPTBasePtr        20
    padding                         3
}

All fields must be enumerated, including the tag field in tagged unions described in the next section.

Only the constructor is affected, the memory layout as well as getters and setters remain the same.

Tagged Unions

Multiple blocks that share a type field can be aggregated as a type in C into a tagged union. For example to express an algebraic data type such as the following

datatype cap = NullCap | ObjectCap (ref : "20 word")

in C, one might specify the following:

block null_cap {
    padding    28
    cap_type    4
}

block object_cap {
    field ref  20
    padding     8
    cap_type    4
}

tagged_union cap cap_type {
    tag null_cap     0
    tag object_cap   1
}

Note that both of the block declarations for null_cap and object_cap contain a field with the name cap_type at the same bit position with the same size, and that all blocks have the same size. We will later relax the size of the tag field, but the tool does require all blocks in a tagged union to be of the same size and the tag field to always be at the same position.

The fragment tagged_union cap cap_type means that we want the generator to produce a C type cap_t that is a tagged union, and that we want to use cap_type as the tag field that distinguishes which variant of the union we are operating on.

Listing tag null_cap 0 means that we want the block null_cap to be part of the tagged union and that the value 0 in field cap_type will indicate that the rest of the fields are null_cap fields. Similarly tag object_cap 1 means that a value of 1 in the tag field makes the fields of the block object_cap available.

This mapping must be unique, i.e., it is an error for a block to be mentioned in more than one tagged union.

If a block is mentioned in a tagged union, its memory layout stays the same as before, but the code that is generated for it changes types to represent the tagged union instead. All getters and setters for the block will be prefixed with the name of the union, and instead of the block type, they will take and return the union type instead (e.g. cap_t instead of object_cap_t or null_cap_t). The tag field will be omitted from the constructor, because it is determined by the constructor name, and no setter methods for the tag field will be generated. Instead there will be a generic getter function for the tag field which does not need to know which variant of the type it is presented with. It returns an enum whose names correspond to the block names and whose values correspond to the associated tag values.

For instance for the specification above, we will get:

// variant tags:
enum cap_tag {
    cap_null_cap = 0,
    cap_object_cap = 1
};
typedef enum cap_tag cap_tag_t;

// discrimator:
cap_tag_t cap_get_cap_type(cap_t cap);

// constructors:
cap_t cap_null_cap_new(void);
cap_t cap_object_cap_new(uint64_t ref);

// getters and setters for object_cap variant:
uint64_t cap_object_cap_get_ref(cap_t cap);
cap_t cap_object_cap_set_ref(cap_t cap, uint64_t ref);

uint64_t cap_object_cap_ptr_get_ref(cap_t *cap);
void cap_object_cap_ptr_set_ref(cap_t *cap, uint64_t ref);

The getter names are constructed as <union>_<block>_get_<field>. Setters and pointer versions have names with set, ptr_get, and ptr_set respectively. All functions for the tagged union cap will start with cap_.

It is a runtime error to use a getter or setter function on the wrong block variant. That means it is the caller's responsibility to know or check which variant a value of cap_t represents before using e.g. cap_object_cap_get_ref() on it. The tool generates assert statements in the getters and setters. Common idioms are to check the tag in a switch or if statement and work on the contents accordingly.

Advanced Tag Fields

Tag Slices

Occasionally the tag value for a tagged union cannot be encoded in a single field of a block, for instance when the hardware prescribes using e.g. bits 0 and 5..6 for the type of a structure and bits 1..4 for something else.

An example would be the following

block null {
    padding     25
    field typ_h  2
    padding      4
    field typ_l  1
}

block content {
    padding     25
    field typ_h  2
    field some   4
    field typ_l  1
}

tagged_union example ex_type(typ_h, typ_l) {
    null    (0,0)
    content (2,0)
}

Instead of a single tag field, the example union has two tag fields. We also say that the tag ex_type has two slices typ_h and typ_l. The same restrictions as for normal tags apply: they must fit the tag values and all tag slice fields must be mentioned in all blocks of the union.

The fragment ex_type(typ_h, typ_l) means that we want the discriminator function to be called example_get_ex_type and that the tag is made up of the two fields (typ_h, typ_l). Any positive number of fields is allowed. The generated enumeration will still be called example_tag as it would be for a single tag, and it will still contain a name for each alternative bound to a value unique for that variant. (The values consist of the concatenation of the bit pattern of the tag slice values).

Similarly to a single tag field, the tool will not generate getters and setters for any of the tag fields mentioned in the union specification.

The values in the null (0,0) and content (2,0) are positional, i.e., the example assigns the bit pattern 0b10 in field typ_h and 0 in field typ_l to mean the content block and 0 in both to mean the null block.

Masks

The restriction that all tag fields have to have the same size can be relaxed slightly. One can declare multiple possible tag sizes together with a bit mask each that determines which of these sizes applies.

Say for instance that we have more than 16 different cap types, and that some of the cap alternatives only have 4 bits of storage left to store the cap tag, but others have further bits available, e.g. 8. We can declare a bit mask that determines if the tag size is 4 or 8 and use the size 4 tags in the blocks that have fewer bits left.

The following example achieves this:

block null_cap {
    padding    24
    cap_type    8
}

block object_cap {
    field ref  28
    cap_type    4
}

tagged_union cap cap_type {
    mask 4 0x0e
    mask 8 0x0e
    tag null_cap    15
    tag object_cap   1
}

The block null_cap has an 8-bit tag field and the block object_cap has a 4-bit tag field of the same name. Note that both still must start at the same bit positions.

The declarations mask 4 0x0e and mask 8 0x0e are read from the smallest to the largest mask size. mask 4 0x0e declares that the default tag size is 4 bits wide, but if the bits 0b1110 (0x0e) are set, i.e. if tag_value & 0x0e == 0x0e, then the next larger mask will be used. This means values 0 to 13 are using 4 bit tags, 14 and 15 are using 8 bit tags, and all 8 bit tag values must have at least the bits 1, 2, and 3 set.

Since there is no larger mask size after 8 in the example, the mask value there is not important. However, the generator expects the mask of each larger level to have at least all bits set that the masks of all smaller sizes have set, so we chose the same value 0x0e in the example. If we had another level, e.g. 16 bits wide afterwards, we might have chosen 0b11110 (0x1e) for the mask value at 8 bit.

The sub mask inclusion means that the generator can check mask sizes consecutively. I.e. if tag_value & 0xe != 0xe, we always know it is a 4 bit tag, no matter how many other tag sizes are declared.

The tag mask value declared for each alternative in the tagged union must fit the mask for the size of the tag field in the block. Remember that with mask 4 0x0e the 0x0e bit pattern means that size 4 does not apply, and instead the next larger size applies. So in the example, because object_cap has a 4-bit tag field, object_cap must have a tag value where the bits 0x0e are not all set (e.g. 1 is a legal value). Conversely, null_cap has an 8-bit tag, and so at least the bits 0x0e must all be set in the tag value (e.g. 14 or 15 works, as does 255, but not 1).

Masks and tag slices are mutually exclusive, i.e. only one of these can be used for any particular tagged union.

Raw Value Access

The C representation of blocks, whether in a tagged union or not, is a struct containing an array of the architecture word type (uint32 or uint64) of the length computed from the number and width of fields in the block. Given that the memory layout is fully specified it is possible to operate on these raw values directly without breaking the semantics of the generated functions.

This is usually neither recommended nor necessary, and for code undergoing formal verification strongly discouraged, because it introduces additional verification overhead.

However, one special case is occasionally convenient: when initialising an area of memory with 0, and reinterpreting that memory as a bitfield structure, we are guaranteed to get a value where all fields are 0. Similarly, returning an all-0 struct from a function instead of calling the constructor with only 0 parameters may be acceptable, although one should check the generated assembly if the compiler was not able to optimise the constructor away -- if it does then the constructor form is clearer and easier to use in proofs.

The tool will not access padding fields, but is not guaranteed to preserve them when copying. Writing/reading padding fields is not guaranteed to be stable.

Architecture Parameters and Canonical Pointers

The tool supports architectures with 32 and 64 bit word sizes. The command

base 32

selects a 32 bit native word size. The declaration covers all subsequent blocks and unions in the specification. It is possible to give multiple declarations, each covering the subsequent blocks and unions, but it rarely makes sense to pick any other value than the word size of the architecture.

Many 64 bit architectures have the concept of canonical pointers where not all 64 bits of the available space are usable for pointers, but only for example 48 bits. In this setting a pointer value would be canonical if and only if all top bits are set when bit 47 (counting from 0) is set. This is equivalent to sign extension, i.e. treating bit 47 as the sign bit and extending that sign bit up to bit 63.

A potential use for this is for kernel pointers to always have bit 47 set (and all bits above) and for user pointers to always have bit 47 unset (and all bits above).

The bitfield generator allows specifying this canonical bit and whether the top bits for pointers should be sign extended or not like this:

base 64(48,1)

This means: word size is 64, pointers are 48 bits [0..47], and bit 48 and above should be set to 1 if bit 47 is set.

This only affects field_high fields which are assumed to be pointers.

base 64(48,0)

means word size is 64, pointers are at most 48 bits, and no sign extension. This is currently equivalent to just base 64.

It may make sense to use the base command multiple times for the sign extension case. For instance, if the intention is to usually store pointers on x64, one would declare base 64(48,1), but if there is a specific block that wants to not store a pointer, but some other value as field_high, one could declare base 64 just before and base 64(48,1) again just after that block.

The base declaration sets the values of the constants word_size and canonical_size that can be used in size expressions. For example, after the declaration base 64(48,1), word_size will be 64 and canonical_size 48. After base 32, word_size and canonical_size will both be 32.

Syntax Reference

Lexical Structure

The generator distinguishes identifiers, integer literals, comments, white space, keywords, and operators. Keywords and operators are shown inline in the grammar in the next section. The rest are defined below.

A multi-character ident must start with a letter or underscore, followed by letters, underscores or numbers. A one-character ident must consist of one letter. Letters are Latin letters [a-z] and [A-Z], digits are [0-9].

IDENT ::= [A-Za-z_][a-zA-Z0-9_]+ | [A-Za-z]

In addition to the regular expression above, identifiers that consist only of underscores are invalid. Valid examples are a, a_20, _a_20_b. Invalid examples are 0, _, 0a or 0_a.

Integer literals can be decimal, octal, hex, or binary, followed by an optional l or L specifier.

INTLIT ::= (  [1-9][0-9]*
            | 0[oO]?[0-7]+
            | 0[xX][0-9a-fA-F]+
            | 0[bB][01]+
            | 0 ) [lL]?'

Valid examples are 15, 017, 0o17, 0x0F, 0xF, 0XF, 0b1111, 0, 0l, and 15L. Invalid examples are 10A0, 0xFG0, 0b20, 090.

Comments start with -- or with # and go until the end of the line. There are no block comments.

-- this is a valid comment
## as is this

Whitespace are spaces and tab characters. They delimit other tokens. Consecutive whitespace is ignored.

Syntactic Structure

spec :== ( base | block | tagged_union )*

base ::= "base" ("32"|"64") canonical_spec?
canonical_spec ::= "(" expr "," ("0"|"1") ")"

block ::= "block" IDENT visible_order_spec_opt? "{" fields "}"
visible_order_spec_opt ::= "(" visible_order_spec? ")"
visible_order_spec ::= IDENT ("," IDENT)*
fields ::= ( "field" IDENT expr |
             "field_high" IDENT expr |
             "field_ptr" ("(" expr ")")? IDENT expr |
             "padding" expr )*

tagged_union ::= "tagged_union" IDENT IDENT tag_slices? "{" masks tags "}"

masks ::= ( "mask" expr expr )*
tags ::= ( "tag" IDENT tag_value )*
tag_slices ::= "(" IDENT ("," IDENT)* ")"
tag_value ::= expr | "(" expr ( "," expr )* ")"

expr ::= expr ("+" | "-" | "*" | "/" | "%") expr
       | "-" expr | "(" expr ")" | INTLIT
       | "word_size" | "canonical_size"

Forward references to names that are declared later in the file are allowed. Multiplicative operators "*", "/", "%" bind stronger than "+" and "-". Binary operators are left-associative. Unary minus binds stronger than all of the other operators. The constants "word_size" and "canonical_size" are only available in "field", "field_high", "field_ptr", and "padding" expressions and refer to the most recent corresponding "base" declaration.

Restrictions

The following restrictions are not represented by the grammar in the previous section, but will be checked by the generator.

  • the canonical bit must be smaller than the base size
  • field sizes must not be larger than the base size
  • all field sizes must be positive
  • all padding sizes must be non-negative
  • the field sizes in a block must add up to multiples of the base size
  • all identifiers mentioned as tags in a union must be blocks
  • all blocks in a tagged union must have the same size
  • all blocks in a tagged union must mention the tag field(s)
  • all blocks in a tagged union must have the tag field(s) at the same bit position
  • all blocks in a tagged union must have tag fields of one of the sizes declared in the union, or if no mask size is declared, the tag fields must all be of the same size.
  • each block can be used in at most one tagged union
  • visible_order_spec must enumerate all fields if used
  • tag values must fit into the size of the tag fields
  • tag values must not be larger than an int of the underlying platform, because they will be used in an enum in C.

Correctness

Since the tool generates both the implementation and the specification, the question arises what kind of correctness property we gain. Even if the generated proofs are checked for correctness by Isabelle/HOL, they could still be expressing properties that are not useful or specify behaviour that we do not want the code to have. What we do know after Isabelle/HOL has checked the proofs is that the code implements the generated specification and that the generated C code is free from undefined behaviour.

To gain confidence that the generated specification is meaningful and itself has the right properties it needs human inspection or it needs to be used in further proofs. In the seL4 verification both are the case. In particular, the generated specification together with the generated implementation proofs are the interface that the C verification uses when it encounters bitfield functions. If the generated specifications are trivial, not useful, wrong, or do not express the behaviour of the generated C (e.g. if the generated correctness proof was trivial), these proofs could not succeed, because the specification would then not provide the needed semantics of the code (e.g. that loading a value after storing it does not lose information). If the specification provides strong enough properties to make the rest of the C verification succeed, we know that the specification is strong enough for our needs, even though it might still be missing additional facts or properties that the generated C code might have. This is sufficient for strong functional correctness.

Limitations

The tool does not currently support proofs for all combinations of pointer access functions for variants, but it does support all those that are used in seL4. If proof support for new combinations is needed, the tool could be extended relatively easily.

Command Line

Usage

The tool can be invoked from the command line as follows

bitfield_gen.py [options] <input.bf> <outfile>

where <input.bf> is a bitfield specification file as described in this manual. If no file is given, the tool reads from stdin, and if no output file is given, the tool prints to stdout.

By default, the output is generated C code. See options below for how to additionally produce Isabelle/HOL theory files for specifications and proofs.

Options

-h, --help:
Show help message and exit.

--environment={seL4,libseL4}:
Whether to produce code for the seL4 or libseL4 environment. This influences which header files are included, what statement to use for assert, what type names uint32 etc have, and how a function is declared inline. The values are determined by the INCLUDES, ASSERTS, INLINE, and TYPES dictionaries in the generator source.

--hol_defs:
Produce Isabelle/HOL theory files with specifications and definitions.

--hol_proofs:
Produce Isabelle/HOL theory files with proofs, assuming specification and definition files exist. Requires the --umm_types option.

--sorry_lemmas:
When producing proofs, only state lemmas, but do not prove them. This requires explicitly setting quick_and_dirty in Isabelle and is useful for reducing processing time during other proof development.

--prune=<file>:
Append <file> to the list of files with pruning information, which contain the names of bitfield functions that are used in the client code base. The option can be used multiple times. If this option is used, the generator will only produce the functions that are listed in one of these files. This is useful to reduce code size, processing time, and especially proof processing time drastically. It does make it harder to discover which functions are generated by the tool, so if that is important, consider leaving this option off until proofs are needed.

--toplevel=<type>:
Append Isabelle/HOL <type> name to the list of top-level heap types. The proofs will generate appropriate frame conditions for all of these top-level heap types. The option can be used multiple times.

--umm_types=<file>:
Path to the file umm_types.txt which contains dependency information between the top-level heap types (e.g. which structs contain which other structs or type fields). The file can be generated using spec/cspec/mk_umm_types.py in the l4v repository. (UMM = unified memory model).

--cspec-dir=<dir>:
Location of the cspec directory containing the theory file KernelState_C. This is specific to the seL4 proof setup.

--thy-output-path=<path>:
Where to put the generated theory files.

--skip_modifies:
Do not emit "modifies" proofs. These proofs state and show which global state is modified by each generated function. They are weaker versions of the full specifications and may not be needed if the full specifications are always used instead.

--showclasses:
Print the parsed classes and fields.

--debug:
Print debug information during code and proof generation.

--from_file=<FROM_FILE>:
Original source file before preprocessing. Use this to produce a Generated from FROM_FILE comment in the output.

seL4 build

In the seL4 build system, .bf files are usually first passed through the C preprocessor to process #include and #ifdef statements, as well as perform macro expansion, which are currently not part of the bitfield specification language.

The source files containing C preprocessor macros and includes are called .bf. There are the following main files in seL4:

  • include/structures_32.bf and include/structures_64.bf contain generic kernel objects and capabilities for 32 and 64 bit systems
  • include/arch/**/object/structures.bf contain architecture dependent object and capability definitions.
  • some plat/** directories contain hardware.bf files for hardware structures.
  • libseL4 contains some bitfield definitions that are shared with the kernel in libsel4/sel4_arch_include/*/sel4/sel4_arch/types.bf, libsel4/arch_include/*/sel4/arch/shared_types.bf, and libsel4/mode_include/*32*/sel4/shared_types.bf.

The generated C code contains a comment which .bf include hierarchy the definitions are coming from.

The generated code can be found in a build directory under the generated directory, where e.g. structures_gen.h will have been generated from structures.bf.pbf, which in turn is the preprocessed source of structures.bf.

Note that the seL4 build uses the --prune option, which means that the generator will only produce functions that are mentioned in the rest of the seL4 sources. To generate a function that is not yet mentioned, simply use a supported function name pattern in the source (e.g. <block>_get_<field>) and the generator will supply it. To discover which functions can be generated, either use the name template descriptions in this manual or consider temporarily removing the --prune option in the cmake build to look at the generated sources. This should only be done when no proofs are required, because it will drastically increase proof processing time.