Much like the erased-state checksums in our rbyds (ecksums), these
block-level erased-state checksums (becksums) allow us to detect failed
progs to erased parts of a block and are key to achieving efficient
incremental write performance with large blocks and frequent power
cycles/open-close cycles.
These are also key to achieving _reasonable_ write performance for
simple writes (linear, non-overwriting), since littlefs now relies
solely on becksums to efficiently append to blocks.
Though I suppose the previous block staging logic used with the CTZ
skip-list could be brought back to make becksums optional and avoid
btree lookups during simple writes (we do a _lot_ of btree
lookups)... I'll leave this open as a future optimization...
Unlike in-rbyd ecksums, becksums need to be stored out-of-band so our
data blocks only contain raw data. Since they are optional, an
additional tag in the file's btree makes sense.
Becksums are relatively simple, but they bring some challenges:
1. Adding becksums to file btrees is the first case we have for multiple
struct tags per btree id.
This isn't too complicated a problem, but requires some new internal
btree APIs.
Looking forward, which I probably shouldn't be doing this often,
multiple struct tags will also be useful for parity and content ids
as a part of data redundancy and data deduplication, though I think
it's uncontroversial to consider this both heavier-weight features...
2. Becksums only work if unfilled blocks are aligned to the prog_size.
This is the whole point of crystal_size -- to provide temporary
storage for unaligned writes -- but actually aligning the block
during writes turns out to be a bit tricky without a bunch of
unecesssary btree lookups (we already do too many btree lookups!).
The current implementation here discards the pcache to force
alignment, taking advantage of the requirement that
cache_size >= prog_size, but this is corrupting our block checksums.
Code cost:
code stack
before: 31248 2792
after: 32060 (+2.5%) 2864 (+2.5%)
Also lfsr_ftree_flush needs work. I'm usually open to gotos in C when
they improve internal logic, but even for me, the multiple goto jumps
from every left-neighbor lookup into the block writing loop is a bit
much...
Mainly by not looking up left neighbors after the first of many
block/fragment/crystal writes.
The right neighbors should already be avoiding redundant lookups since
redundant lookups imply a full fragment/block on the not-last write.
Additionally, once we fail our crystallization check, we can assume all
future fragment writes will fail, so we only need to do that lookup
once.
There's probably still more to optimize, but the way these heuristics
interact are tricky...
This combines the previous lfsr_data_truncate/lfsr_data_fruncate
behavior into a single flexible function, and makes truncate/fruncate
small aliases (drop in the future?).
The combined behavior lets us adopt lfsr_data_slice in more places.
Looking forward, bptr checksums provide an easy mechanism to validate
data residing in blocks. This extends the merkle-tree-like nature of the
filesystem all the way down to the data level, and is common in other
COW filesystems.
Two interesting things to note:
1. We don't actually check data-level checksums yet, but we do calculate
data-level checksums unconditionally.
Writing checksums is easy, but validating checksums is a bit more
tricky. This is made a bit harder for littlefs, since we can't hold
an entire block of data in RAM, so we have to choose between separate
bus transactions for checksum + data reads, or extremely expensive
overreads every read.
Note this already exists at the metadata-level, the separate bus
transactions for rbyd fetch + rbyd lookup means we _are_ susceptible
to a very small window where bit errors can get through.
But anyways, writing checksums is easy. And has basically no cost
since we are already processing the data for our write. So we might
as well write the data-level checksums at all times, even if we
aren't validating at the data-level.
2. To make bptr checksums work cheaply we need an additional cksize
field to indicate how much data is checksummed.
This field seems redundant when we already have the bptr's data size,
but if we didn't have this field, we would be forced to recalculate
the checksum every time a block is sliced. This would be
unreasonable.
The immutable cksize field does mean we may be checksumming more data
than we need to when validating, but we should be avoiding small
block slices anyways for storage cost reasons.
This does add some stack cost because our bptr struct is larger now:
code stack
before: 31200 2768
after: 31272 (+0.2%) 2800 (+1.1%)
This avoids redundant lookups when holes are involved.
And we don't really leverage data holes as an abstraction well. We use
data holes in two places, but they do two different things, so they may
as well be specialized operations.
code stack
before: 31092 2752
after: 31200 (+0.3%) 2768 (+0.6%)
Instead of writing every possible config that has the potential to be
useful in the future, stick to just writing the configs that we know are
useful, and error if we see any configs we don't understand.
This prevents unnecessary config bloat, while still allowing configs to
be introduced in a backwards compatible way in the future.
Currently unknown configs are treated as a mount error, but in theory
you could still try to read the filesystem, just with potentially
corrupted data. Maybe this could be behind some sort of "FORCE" mount
flag. littlefs must never write to the filesystem if it finds unknown
configs.
---
This also creates a curious case for the hole in our tag encoding
previously taken up by the OCOMPATFLAGS config. We can query for any
config > SIZELIMIT with lookupnext, but the OCOMPATFLAGS flag would need
an extra lookup which just isn't worth it.
Instead I'm just adding OCOMPATFLAGS back in. To support OCOMPATFLAGS
littlefs has to do literally nothing, so this is really more of a
documentation change. And who know, maybe OCOMPATFLAGS will have some
weird use case in the future...
Also:
- Renamed GSTATE -> GDELTA for gdelta tags. GSTATE tags added as
separate in-device flags. The GSTATE tags were already serving
this dual purpose.
- Renamed BSHRUB* -> SHRUB when the tag is not necessarily operating
on a file bshrub.
- Renamed TRUNK -> BSHRUB
The tag encoding space now has a couple funky holes:
- 0x0005 - Hole for aligning config tags.
I guess this could be used for OCOMPATFLAGS in the future?
- 0x0203 - Hole so that ORPHAN can be a 1-bit difference from REG. This
could be after BOOKMARK, but having a bit to differentiate littlefs
specific file types (BOOKMARK, ORPHAN) from normal file types (REG,
DIR) is nice.
I guess this could be used for SYMLINK if we ever want symlinks in the
future?
- 0x0314-0x0318 - Hole so that the mdir related tags (MROOT, MDIR,
MTREE) are nicely aligned.
This is probably a good place for file-related tags to go in the
future (BECKSUM, CID, COMPR), but we only have two slots, so will
probably run out pretty quickly.
- 0x3028 - Hole so that all btree related tags (BTREE, BRANCH, MTREE)
share a common lower bit-pattern.
I guess this could be used for MSHRUB if we ever want mshrubs in the
future?
I'm just not seeing a use case for optional compat flags (ocompat), so
dropping for now. It seems their *nix equivalent, feature_compat, is
used to inform fsck of things, but this doesn't really make since in
littlefs since there is no fsck. Or from a different perspective,
littlefs is always running fsck.
Ocompat flags can always be added later (since they do nothing).
Unfortunately this really ruins the alignment of the tag encoding. For
whatever reason config limits tend to come in pairs. For now the best
solution is just leave tag 0x0006 unused. I guess you can consider it
reserved for hypothetical ocompat flags in the future.
---
This adds an rcompat flag for the grm, since in theory a filesystem
doesn't need to support grms if it never renames files (or creates
directories?). But if a filesystem doesn't support grms and a grms gets
written into the filesystem, this can lead to corruption.
I think every piece of gstate will end up with its own compat flag for
this reason.
---
Also renamed r/w/oflags -> r/w/ocompatflags to make their purpose
clearer.
---
The code impact of adding the grm rcompat flag is minimal, and will
probably be less for additional rcompat flags:
code stack
before: 31528 2752
after: 31584 (+0.2%) 2752 (+0.0%)
Also renamed lfsr_rbyd_compact -> lfsr_rbyd_appendcompaction to make
room for a more high-level lfsr_rbyd_compact and emphasize that this is
an append operation.
lfsr_rbyd_appendshrub was a common pattern emerging in mdir commit
related functions for moving shrubs around. It's just a useful function
to have.
lfsr_rbyd_compact, on the other hand, is only useful for the evicting
bshrubs. Most other rbyd compaction code involves weird corner cases and
doesn't seem to generalize well (or at least I can't see it). But adding
lfsr_rbyd_compact is nice for consistency with the mdir commit/compact
functions. And it can be used by lfsr_bshrub_commit at least...
Code changes minimal:
before: 31596 2752
after: 31528 (+0.2%) 2752 (+0.0%)
Now that error recovery is well defined (at least in theory), and
high-level mdir/file functions create on-stack copies, the on-stack
copies for the low-level rbyd/btree/bshrub commit functions are
redundant and not useful.
Dropping the redundant on-stack copies in low-level functions saves a
bit for stack usage.
Additionally, we can adopt lfsr_rbyd_commit in more places where it was
avoided to avoid even more redundant on-stack copies.
code stack
before: 31676 2776
after: 31596 (-0.3%) 2752 (-0.9%)
Since we need access to the pending attr-list in lfsr_mdir_compact__
now, we might as well just do the pending commit. Worst case, a call to
lfsr_mdir_compact__ can provide a NULL attr-list for the previous
behavior (though we always follow up compaction with a commit).
The only downside is that this behavior is now a bit different from
lfsr_rbyd_compact. Need to revisit lfsr_rbyd_compact and see if it
should do the same.
This had a tiny improvement to code size, which I think is just the cost
of two function calls:
code stack
before: 31716 2776
after: 31676 (-0.1%) 2776 (+0.0%)
Turns out we don't need lfsr_ftree_(unbuffered)readnext. Even if we did,
the unbuffered behavior can be obtained by passing a NULL buffer to
lfsr_ftree_readnext.
Code changes minimal:
code stack
before: 31760 2776
after: 31716 (-0.1%) 2776 (+0.0%)
lfsr_ftree_t acts as a sort of proto-file type, holding enough
information for file reads/writes if the relevant mdir is known.
This lets low-level file write operations operate on a copy of the
proto-file without needing to copy the relevant mdir, file stuff, etc.
To make this work, lfsr_mdir_commit also needs to stage any bshrubs in
the attr-list, since these may not be in our opened file list, but this
is a good thing to handle implicitly anyways. We should only ever have
one untracked bshrub being operated on (multithreaded support would be a
whole other can of worms).
Unfortunately the extra machinery in lfsr_mdir_commit, and the fact that
passing two pointers around instead of one adds quite a bit of code,
means this comes with a code cost. But the tradeoff for stack cost and
no risk of stack pointers in our opened file list makes this probably
worth it:
code stack
before: 31584 2824
after: 31760 (+0.6%) 2776 (-1.7%)
Not yet tested, thus the "in theory". Testing this is going to be a bit
tricky. Fortunately on-stack copies are a pretty resilient way to
recover from errors.
This comes with an unfortunate, but necessary, code/stack increase:
code stack
before: 31280 2736
after: 31584 (+1.0%) 2824 (+3.2%)
The main optimization here is to try the minimize the number of
individual btree/bshrub commits. We can't always perform
lfsr_btree_carve in a single commit unfortunately, due to unbounded
crystal fragments and needing to commit to different leaf rbyds, but we
can combine attrs into single commits more than we were previously.
This is especially important for bshrubs, where commits can trigger full
mdir compactions.
Additionally, the method we use for breaking up small blocks into
fragments has been changed to never "lose" the underlying block pointer
until the fragmentation is complete. This prevents the block pointer
misallocation bug found earlier.
In theory this won't be necessary once file writes create on-stack
bshrub copies for error recovery, but it's at least nice to prove that
this is possible in case we ever want to not maintain on-stack bshrub
copies (code savings?).
code stack
before: 31512 2648
after: 31280 (-0.7%) 2736 (+3.3%)
This is a nuanced optimization that is relatively unique to littlefs's
use case.
Because we're in a RAM constrained environment, it's not unreasonable
for whatever temporary buffer is used to write to a file to exceed the
write buffer allocated for the file. Since the temporary buffer is
temporary, it may be orders of magnitude larger than the file's write
buffer. If this happens, breaking up the write into write-buffer-sized
chunks so we can copy the data through the file's write buffer just
wastes IO.
In theory, littlefs should work just fine with a zero-sized file write
buffer, though this is not yet tested.
One interesting subtlety with the current implementation, we still flush
the file's write buffer when we bypass it. In theory you can avoid
flushing, but this risks strange write orders that could make low-level
write heuristics (such as the crystallization threshold) behave really
poorly.
---
Unfortunately the tests are now failing due to an interesting but
unrelated bug. It turns out bypassing our file buffer allows in-btree
block pointers to interact with heavily fragmented btrees for the first
time in our tests. This leads to an incorrect allocation of a block that
is in the previous copy of a file's btree during lfsr_btree_carve.
This isn't an issue for btree inner nodes. Btree commits happen
atomically, with the new btree being allocated on the stack and
protected by allocator checkpoints until completion.
But this is an issue for the block pointers, because lfsr_btree_carve is
not atomic and results in intermediary states where the block pointers
are lost.
What's a bit funny, is this commit is actually a part of some
preparation to introduce temporary file copies during lfsr_file_write
for error recovery. This would mean the previous state of our file would
remain viewable by the block allocator, fixing this bug.
I need to think a bit more on if this is the correct solution (debugging
these multi-layer bugs turns my brain into mush), but I think it is, in
which case this bug was fixed before it was even discovered.
Fortunately these operations are heavily tested in test_dirs. The only
difference with files is the possibility for shrubs to need to be
copied.
Bugs fixed:
- It's counterintuitive, but lfsr_rbyd_appendcompactattr _can_ error
with LFS_ERR_RANGE when we are copying a shrub. This can happen if the
underlying mdir needs compaction itself.
- It's possible to null-trunk bshrubs to appear in our filesystem
traversal. Null-trunk bshrubs don't usually appear in any stable
state, but they are created by lfsr_bshrub_alloc and lfsr_btree_commit
to represent new, yet-uncommitted shrubs.
This gets a bit tricky because we also use null-trunks to indicate if
lfsr_btree_traversal has traversed the root. We can't rely on
bid >= weight for this because zero-weight btrees are allowed.
The solution here, though maybe temporary (famous last words), is to
treat null-trunk btrees as not having a root. Which isn't really true,
but null-trunk btree roots only exist between allocator checkpoints,
so they are allowed to be unreachable.
We really need more asserts that this is the case though... At least
added an assert that we never commit/read null trunks on disk.
POSIX is notoriously full of subtle and confusing nuances. Not through
any fault of POSIX, but as a result of trying to describe a complex
system with simple and easy to use operations.
Corner cases fixed here:
- rename("dir", "file") => ENOTDIR
This is the main surprise to me, and a mistake on my part. I thought
EISDIR would be appropriate for any renames with mismatched types,
since both involve a directory. It would be simpler code-wise, and
avoid ambiguity around if "file" is not a dir, or some other file
exists in the file's path. But I guess ENOTDIR makes more sense if you
think of the destination as the target being operated on.
- remove("/") => EINVAL
- rename("/", "x") => EINVAL
- rename("x", "/") => ENOTEMPTY
- open("/") => EISDIR
It's a bit difficult to lookup what error codes around root operations
should be, since they mostly end up as EPERM on modern systems, but
this doesn't really make sense for littlefs.
The solution chosen here is to prefer directory-related errors (EISDIR,
ENOTEMPTY) when possible, and fall back to EINVAL when the only issue
is that the target is the root directory.
Also I tweaked lfsr_mtree_pathlookup a bit so mid=0 indicates the target
is the root and mid=-1 indicates the target can't be created (because of
a missing directory). I think using mid=0 for the latter is a leftover
from when mid=-1 was a bit of a mess...
The previous encoding was a bit problematic with our linear and log
heuristics, which can grow thousands of powerlosses deep. You know you
have a problem when you're copying a test id that spans a dozen lines.
It also meant we were spending O(n^2) time just encoding powerloss ids:
before: 942.68s
after: 921.94s (-2.2%)
This new encoding takes advantage of the unused characters in our leb16
encoding, with an 'x' prefix indicating linear-heuristic powerlosses
and a 'y' prefix indicating log-heuristic powerlosses ('w' is used for
negative leb16s).
Before:
- explicit: 42q2q2
- linear: 123456789abcdefg1h1i1j1k1l1m1n1o1p1q1r1s1t1u1v1
- log: 1248g1g2g4g8gg1gg2gg4gg8
After:
- explicit: 42q2q2
- linear: xg2
- log: yc
This turned out to not be all that useful.
Tests already take quite a bit to run, which is a good thing! We have a
lot of tests! 942.68s or ~15 minutes of tests at the time of writing to
be exact. But simply multiplying the number of tests by some number of
geometries is heavy handed and not a great use of testing time.
Instead, tests where different geometries are relevant can parameterize
READ_SIZE/PROG_SIZE/BLOCK_SIZE at the suite level where needed. The
geometry system was just another define parameterization layer anyways.
Testing different geometries can still be done in CI by overriding the
relevant defines anyways, and it _might_ be interesting there.
Since we were only registering our inotify reader after the previous
operation completed, it was easy to miss modifications that happened
faster than our scripts. Since our scripts are in Python, this happened
quite often and made it hard to trust the current state of scripts
with --keep-open, sort of defeating the purpose of --keep-open...
I think previously this race condition wasn't avoided because of the
potential to loop indefinitely if --keep-open referenced a file that the
script itself modified, but it's up to the user to avoid this if it is
an issue.
---
Also while fixing this, I noticed our use of the inotify_simple library
was leaking file descriptors everywhere! I just wasn't closing any
inotify objects at all. A bit concerning since scripts with --keep-open
can be quite long lived...
It turns out permanent root bookmark creates some rather interesting
constraints on our mtree:
1. We can never delete all mids, since at least one mid needs to exist
to represent the root's bookmark.
2. We can never revert to an inlined mdir after uninlining, since our
root bookmark always exists to stop this. This is an unfortunate
downside as it would be nice to be able to reinline mdirs, but not
the end of the world.
This restricts what operations are possible, and transitively, what we
can test.
This commit drops the removal of root bookmarks in test_mtree, which was
a workaround to keep tests from early implementation running. This was
preventing some minor optimizations. This required dropping some tests,
but these tests tested operations that aren't really possible in
practice.
Dropping the removal of root bookmarks allowed for a minor optimization
in lfsr_mdir_drop, and may lead to more in the future (or maybe just
stricter asserts):
code stack
before: 31280 2648
after: 31208 (-0.2%) 2648 (+0.0%)
The logic is simpler with lfsr_mdir_drop being an implicit side-effect
of lfsr_mdir_commit, and right now a simpler system is preferred over a
complex one.
I think it will be more useful to focus on reducing the stack overhead
of lfsr_mdir_drop in general.
Still, it may be worth reverting this revert in the future.
This directly trades code cost for stack cost, by moving the stack
overhead of lfsr_mdir_drop out of the hot-path. Currently, the stack
hot-path is the mdir commit needed to flush file bshrubs, and this can
never drop an mdir:
before: 31280 2648
after: 31312 (+0.1%) 2520 (-4.8%)
This ended up being much less of a simplification than I hoped it would.
It's still easier/more efficient to revert to a relocation in most cases
when dropping in an mdir split, and the small gain from simplifying how
drops/commits interact is overshadowed by the code duplication necessary
to separate lfsr_mdir_drop out from lfsr_mdir_commit:
code stack
before: 30952 2528
after: 31280 (+1.1%) 2648 (+4.7%)
Still, this does at least simplify the logical corner cases (we don't
need to abort commits when droppable anymore), and lfsr_mdir_drop is
ultimately necessary for supporting lazy file creation.
Also having a fix-orphans step during mount allows other littlefs
implementations the option to create orphanned mdirs without compat
issues. So this ends up the more flexible approach.
It _might_ be worth having both eager mdir drops and an explicit
lfsr_mdir_drop for lazy file creation in the future, but I doubt this
will end up worth the code duplication...
---
Oh right, I forgot to actually describe this change.
This trades eager mdir drops:
1. Drop mdirs from the mtree immediately as soon as their weight goes
to zero.
For lazy mdir drops:
1. Drop mdirs from the mtree in a second commit.
2. Scan and drop orphaned mdirs on the first write after mount.
This sounds very similar to the previous "deorphan" scan, which risked
an extreme performance cost during mount, but it should be noted this
orphan scan only needs to touch every mdir once. This makes it no worse
than the overhead of actually mounting the filesystem.
We can also keep an eye out for orphaned mdirs when we mount, so no
extra scan is needed unless there was an unlucky powerloss.
Eager mdir dropping sounds simpler, but thanks to deferred commits
introduces some subtle complexity around aborting commits that would
drop an mdir to zero. Remember commits are viewable on-disk as soon as a
commit completes.
In _theory_, lazy mdir drops simplify the logic around committing to
mdirs.
Though the real kicker is that lazy mdir drops are required for lazy file
creation.
The current idea for lazy file creation involves tracking mid-less
opened-but-not-yet-created files. These files can have bshrubs, so they
need space on an mdir somewhere. But they aren't actually created yet,
so they don't have an mid.
This is fine (though it's probably going to be tricky) as long as we
allocate an mid on file sync, but there is always a risk of losing power
with mdirs that contain only RAM-backed files. Fortunately, no-mids
means no orphaned files, but it does mean orphaned mdirs with no synced
contents.
Long story short, lazy mdir drops are currently a necessary evil, and
logical simplification, that unfortunately comes with some cost.
This saves a bit of stack space in theory, but after the redund block
restructure this amounts to a single word.
But lfsr_mdir_commit__ isn't on the hot path anyways, so this doesn't
even matter...
code stack
before: 30948 2528
after: 30952 (+0.0%) 2528 (+0.0%)
The idea here is to revert moving redund blocks into lfsr_rbyd_t, and
instead just keep a redundant copy of the rbyd blocks in the redund
blocks in lfsr_mdir_t.
Surprisingly, extra overhead in lfsr_mdir_t ended up with worse stack
usage than extra overhead in lfsr_rbyd_t. I guess we end up allocated
more mdirs than rbyds, which makes a bit of sense given how complicated
lfsr_mdir_commit is:
code stack structs
redund union: 30976 2496 1072
redund in rbyd: 30948 (-0.1%) 2528 (+1.3%) 1100 (+2.6%)
redund in mdir: 31000 (+0.1%) 2536 (+1.6%) 1092 (+1.8%)
The mdir option does seem to improve struct overhead, but this hasn't
been a reliable measurement since it doesn't take into account how many
of each struct is allocated.
Given that the mdir option is inferior in both code and stack cost, and
requires more care to keep the rbyd/redund blocks in sync, I think I'm
going to revert this for now but keep the commit in the commit history
since it's an interesting comparison.
This simplifies dependent structs with redundancy, mainly lfsr_mdir_t,
at a significant RAM cost:
code stack structs
before: 30976 2496 1072
after: 30948 (-0.1%) 2528 (+1.3%) 1100 (+2.6%)
Which, to be honest, is not as bad as I thought it would be. Though it
is still pretty bad for no new features.
The motivation for this change:
1. The organization of the previous lfsr_mdir_t struct was a bit hacky
and relied on exact padding so the redund block array and rbyd block
lined up at the right offset.
2. The previous organization prevented theoretical "read-only rbyd
structs" that could omit write-related fields, e.g. eoff and cksum.
This idea is currently unused.
3. The current mdir=level-1, btree/data=level-0 redund design makes this
RAM tradeoff pretty bad, but in theory higher btree redund levels
would need the extra redund blocks in the rbyd struct anyways.
Still, the RAM impact to the current default configuration means this
should probably be reverted...
- Renamed mdir->u.m to mdir->u.mdir.
- Prefer mdir->u.rbyd.* where possible.
- Changed file/dir mdirs to be stored directly, requiring a cast to
lfsr_openedmdir_t to enroll in the opened mdir list.
Reverted to one set of signed lfsr_mid_rid/bid functions, and tried to
make their usage more consistent.
We have two ways to compare mdirs now, lfsr_mdir_cmp (compares block
addresses) and lfsr_mdir_bid (compares mids), and it's not very clear
when to use which one. lfsr_mdir_cmp is a bit more robust in weird mid
cases (mainly inlined mdirs when mroot mid=-1), so currently preferring
that.
Also did some bit twiddling to preserve mid=-1 => bid=-1 and rid=-1,
this save a bit of code:
code stack
before: 31056 2488
after: 30972 (-0.3%) 2496 (+0.3%)
There's only one mtree in a given filesystem. With the recent
lfsr_mdir_commit restructure, it makes more sense for the mtree to be
implicit.
code stack
before: 31096 2480
after: 31016 (-0.3%) 2480 (+0.0%)
Originally, the intention of this rework was to make it possible to
shrub the mtree, i.e. allow an mshrub, i.e. inline the root rbyd of the
mtree to be inlined in the mroot.
This would allow small mtrees, 2, 3, etc mdirs, to save a block that
would be needed for the mtree's root.
But as the mshrub was progressing, minor problems kept unfolding, and
ultimately I've decided to shelve the idea of mshrubs for now. They add
quite a bit of complexity for relatively little gain:
- bshrubs are just complicated to update. They require a call to
lfsr_mdir_commit to update the inlined-root, which is a bit of a
problem when your mshrub needs to be updated inside lfsr_mdir_commit,
and your system disallows recursion...
Recursion _can_ be avoided by separate bshrub commit variants that go
through either lfsr_mdir_commit or lfsr_mdir_commit_, but this
complicates things and requires some code duplication, weakening the
value of reusing the bshrub data-structure.
- It's not always possible to compact the mshrub's backing mroot when
we need to modify the mshrub.
If an mroot becomes full and needs to split, for example, we need to
allocate the new mdirs, update the (new) mshrub, and then commit
everything into the mroot when we compact. But the "update the (new)
mshrub" step can't be done until after we compact, because the mroot
is by definition full.
This _can_ also be worked around, by building an attr list containing
all of the mshrub changes, and committing the mshrub/mroot changes in
the same transaction, but this complicates things and increases the
stack cost for the current hot-path.
- Every shrub needs a configurable shrub size, and the mshrub is no
exception. This adds another config option and complicates shared
shrub eviction code.
- The value for mshrubs is not actually that great.
Unlike file bshrubs, there's only one mshrub in the filesystem, and
I'm not sure there's a situation where a filesystem has >1 mdirs and
the exact number of allocated blocks is critical.
And this complexity is reflected in code cost and robustness, not to
mention developer time. I think for littlefs this is just not worth
doing. At least not now.
We can always introduce mshrubs in a backwards compatible manner if
needed.
---
But this rework did lead to better code organization around mdir commits
and how they update the mtree/mroot, so I'm keeping those changes.
In general lfsr_mdir_commit has been broken up into mtree/mroot specific
functions that _do_ propagate in-device changes. Any commit to the mroot
changes the on-disk state of the filesystem anyways, so the mroot commit
_must_ be the last thing lfsr_mdir_commit does.
This leads to some duplicated updates, but that's not really a problem.
Here's the new call graph inside lfsr_mdir_commit:
lfsr_mdir_commit
.---------' | | | '-----------------.
v | | '-----------------. |
lfsr_mtree_commit | '--------. | |
'---------. | | | |
v v | | |
lfsr_mroot_commit | | |
| '--------. | | |
| v v | |
| lfsr_mdir_commit_ | |
| .--------' '--------. | |
| | .-----------------|-' |
v v v v v
lfsr_mdir_commit__ lfsr_mdir_compact__
This rework didn't really impact code/stack that much. It added a bit of
code, but saved a bit of RAM. The real value is that the narrower-scoped
functions contain more focused logic:
code stack
before: 30780 2504
after: 31096 (+1.0%) 2480 (-1.0%)
This is just a useful type to have to make the code a bit more
readable.
This doesn't affect the code that much, except we are making more
on-stack copies of mptrs since the mdir doesn't technically contain
a mutable mptr. Maybe this should change?
code stack
before: 30768 2496
after: 30776 (+0.0%) 2504 (+0.3%)
- There was a lingering strict pcache assert in lfs_bd_erase. Very
unlikely to hit, but it is possible and shouldn't be an assert now
that pcache can be left in an arbitrary state. That being said, it
was asserting on an actual bug in this case.
- Our btree traversal was not traversing the roots of zero-weight
btrees. Zero-weight btrees can happen as an intermediary step during
btree/bshrub carving. If the stars align with the block allocator and
intermediary carving states this can cause incorrect block
allocations.
- Staged updates to bsprouts/bshrubs need to be played out before
updates to opened mdirs lfsr_mdir_commit, this is just because
lfsr_file_isbsprout/isbshrub depend on mdir.block and updating the
mdirs first corrupts this.
Maybe a different organization to this code would be useful, it is
already full of TODOs.
It turned out by implicitly handling root allocation in
lfsr_btree_commit_, we were never allowing lfsr_bshrub_commit to
intercept new roots as new bshrubs. Fixing this required moving the
root allocation logic up into lfsr_btree_commit.
This resulted in quite a bit of small bug fixing because it turns out if
you can never create non-inlined bshrubs you never test non-inlined
bshrubs:
- Our previous rbyd.weight == btree.weight check for if we've reached
the root no longer works, changed to an explicit check that the blocks
match. Fortunately, now that new roots set trunk=0 new roots are no
longer a problematic case.
- We need to only evict when we calculate an accurate estimate, the
previous code had a bug where eviction occurred early based only on the
progged-since-last-estimate.
- We need to manually set bshrub.block=mdir.block on new bshrubs,
otherwise the lfsr_bshrub_isbshrub check fails in mdir commit staging.
Also updated btree/bshrub following code in the dbg scripts, which
mostly meant making them accept both BRANCH and SHRUBBRANCH tags as
btree/bshrub branches. Conveniently very little code needs to change
to extend btree read operations to support bshrubs.
Unfortunately, waiting to evict shrubs until mdir compaction does not
work because we only have a single pcache. When we evict a bshrub we
need a pcache for writing the new btree root, but if we do this during
mdir compaction, our pcache is already busy handling the mdir
compaction. We can't do a separate pass for bshrub eviction, since this
would require tracking an unbounded number of new btree roots.
In the previous shrub design, we meticulously tracked the compacted
shrub estimate in RAM, determining exactly how the estimate would change
as a part of shrub carve operations.
This worked, but was fragile. It was easy for the shrub estimate to
diverge from the actual value, and required quite a bit of extra code to
maintain. Since the use cases for bshrubs is growing a bit, I didn't
want to return to this design.
So here's a new approach based on emulating btree compacts/splits inside
the shrubs:
1. When a bshrub is fetched, scan the bshrub and calculate a compaction
estimate. Store this.
2. On every commit, find the upper bound of new data being progged, and
keep track of estimate + progged. We can at least get this relatively
easily from commit attr lists. We can't get the amount deleted, which
is the problem.
3. When estimate + progged exceeds shrub_size, scan the bshrub again and
recalculate the estimate.
4. If estimate exceeds the shrub_size/2, evict the bshrub, converting it
into a btree.
As you may note, this is very close to how our btree compacts/splits
work, but emulated. In particular, evictions/splits occur at
(shrub_size/block_size)/2 in order to avoid runaway costs when the
bshrub/btree gets close to full.
Benefits:
- This eviction heuristic is very robust. Calculating the amount progged
from the attr list is relatively cheap and easy, and any divergence
should be fixed when we recalculate the estimate.
- The runtime cost is relatively small, amortized O(log n) which is
the existing runtime to commit to rbyds.
Downsides:
- Just like btree splits, evictions force our bshrub to be ~1/2 full on
average. This combined with the 2x cost for mdir pairs, the 2x cost
for mdirs being ~1/2 full on average, and the need for both a synced
and unsynced copy of file bshrubs brings our file bshrub's overhead up
to ~16x, which is getting quite high...
Anyways, bshrubs now work, and the new file topology is passing testing.
An unfortunate surprise is the jump in stack cost. This seems to come from
moving the lfsr_btree_flush logic into the hot-path that includes bshrub
commit + mdir commit + all the mtree logic. Previously the separate of
btree/shrub commits meant that the more complex block/btree/crystal logic
was on a separate path from the mdir commit logic:
code stack lfsr_file_t
before bshrubs: 31840 2072 120
after bshrubs: 30756 (-3.5%) 2448 (+15.4%) 104 (-15.4%)
I _think_ the reality is not actually as bad as measured, most of these
flush/carve/commit functions calculate some work and then commit it in
seperate steps. In theory GCC's shrinkwrapping optimizations should
limit the stack to only what we need as we finish different
calculations, but our current stack measurement scripts just add
together the whole frames, so any per-call stack optimizations get
missed...
Note this is intentionally different from how lfsr_rbyd_fetch behaves
in lfs.c. We only call lfsr_rbyd_fetch when we need validated checksums,
otherwise we just don't fetch.
The dbg scripts, on the other hand, always go through fetch, but it is
useful to be able to inspect the state of incomplete trunks when
debugging.
This use to be how the dbg scripts behaved, but they broke because of
some recent script work.
As a part of the general redesign of files, all files, not just small
files, can inline some data directly in the metadata log. Originally,
this was a single piece of inlined data or an inlined tree (shrub) that
effectively acted as an overlay over the block/btree data.
This is now changed so that when we have a block/btree, the root of the
btree is inlined. In effect making a full btree a sort of extended
shrub.
I'm currently calling this a "geoxylic btree", since that seems to be a
somewhat related botanical term. Geoxylic btrees have, at least on
paper, a number of benefits:
- There is a single lookup path instead of two, this simplifies code a
bit and decreases lookup costs.
- One data structure instead of two also means lfsr_file_t requires
less RAM, since all of the on-disk variants can go into one big union.
Though I'm not sure this is very significant vs stack/buffer costs.
- The write path is much simpler and has less duplication (it was
difficult to deduplicate the shrub/btree code because of how the
shrub goes through the mdir).
In this redesign, lfsr_btree_commit_ leaves root attrs uncommitted,
allowing lfsr_bshrub_commit to finish the job via lfsr_mdir_commit.
- We don't need to maintain a shrub estimate, we just lazily evict trees
during mdir compaction. This has a side-effect of allowing shrubs to
temporarily grow larger than shrub_size before eviction.
NOTE THIS (fundamentally?) DOESN'T WORK
- There is no awkwardly high overhead for small btrees. The btree root
for two-block files should be able to comfortably fit in the shrub
portion of the btree, for example.
- It may be possible to also make the mtree geoxylic, which should
reduce storage overhead of small mtrees and make better use of the
mroot.
All of this being said, things aren't working yet. Shrub eviction during
compaction runs into a problem with a single pcache -- how do we write
the new btrees without dropping the compaction pcache? We can't evict
btrees in a separate pass becauce their number is unbounded...
Also limited block_size/block_count updates to only happen when the
configured value is None. This matches dbgbmap.py.
Basically just a cleanup of some bugs after the rework related to
matching dbgbmap.py. Unfortunately these scripts have too much surface
area and no tests...