forked from Imagelibrary/littlefs
Unfortunately, waiting to evict shrubs until mdir compaction does not
work because we only have a single pcache. When we evict a bshrub we
need a pcache for writing the new btree root, but if we do this during
mdir compaction, our pcache is already busy handling the mdir
compaction. We can't do a separate pass for bshrub eviction, since this
would require tracking an unbounded number of new btree roots.
In the previous shrub design, we meticulously tracked the compacted
shrub estimate in RAM, determining exactly how the estimate would change
as a part of shrub carve operations.
This worked, but was fragile. It was easy for the shrub estimate to
diverge from the actual value, and required quite a bit of extra code to
maintain. Since the use cases for bshrubs is growing a bit, I didn't
want to return to this design.
So here's a new approach based on emulating btree compacts/splits inside
the shrubs:
1. When a bshrub is fetched, scan the bshrub and calculate a compaction
estimate. Store this.
2. On every commit, find the upper bound of new data being progged, and
keep track of estimate + progged. We can at least get this relatively
easily from commit attr lists. We can't get the amount deleted, which
is the problem.
3. When estimate + progged exceeds shrub_size, scan the bshrub again and
recalculate the estimate.
4. If estimate exceeds the shrub_size/2, evict the bshrub, converting it
into a btree.
As you may note, this is very close to how our btree compacts/splits
work, but emulated. In particular, evictions/splits occur at
(shrub_size/block_size)/2 in order to avoid runaway costs when the
bshrub/btree gets close to full.
Benefits:
- This eviction heuristic is very robust. Calculating the amount progged
from the attr list is relatively cheap and easy, and any divergence
should be fixed when we recalculate the estimate.
- The runtime cost is relatively small, amortized O(log n) which is
the existing runtime to commit to rbyds.
Downsides:
- Just like btree splits, evictions force our bshrub to be ~1/2 full on
average. This combined with the 2x cost for mdir pairs, the 2x cost
for mdirs being ~1/2 full on average, and the need for both a synced
and unsynced copy of file bshrubs brings our file bshrub's overhead up
to ~16x, which is getting quite high...
Anyways, bshrubs now work, and the new file topology is passing testing.
An unfortunate surprise is the jump in stack cost. This seems to come from
moving the lfsr_btree_flush logic into the hot-path that includes bshrub
commit + mdir commit + all the mtree logic. Previously the separate of
btree/shrub commits meant that the more complex block/btree/crystal logic
was on a separate path from the mdir commit logic:
code stack lfsr_file_t
before bshrubs: 31840 2072 120
after bshrubs: 30756 (-3.5%) 2448 (+15.4%) 104 (-15.4%)
I _think_ the reality is not actually as bad as measured, most of these
flush/carve/commit functions calculate some work and then commit it in
seperate steps. In theory GCC's shrinkwrapping optimizations should
limit the stack to only what we need as we finish different
calculations, but our current stack measurement scripts just add
together the whole frames, so any per-call stack optimizations get
missed...