Much like the erased-state checksums in our rbyds (ecksums), these
block-level erased-state checksums (becksums) allow us to detect failed
progs to erased parts of a block and are key to achieving efficient
incremental write performance with large blocks and frequent power
cycles/open-close cycles.
These are also key to achieving _reasonable_ write performance for
simple writes (linear, non-overwriting), since littlefs now relies
solely on becksums to efficiently append to blocks.
Though I suppose the previous block staging logic used with the CTZ
skip-list could be brought back to make becksums optional and avoid
btree lookups during simple writes (we do a _lot_ of btree
lookups)... I'll leave this open as a future optimization...
Unlike in-rbyd ecksums, becksums need to be stored out-of-band so our
data blocks only contain raw data. Since they are optional, an
additional tag in the file's btree makes sense.
Becksums are relatively simple, but they bring some challenges:
1. Adding becksums to file btrees is the first case we have for multiple
struct tags per btree id.
This isn't too complicated a problem, but requires some new internal
btree APIs.
Looking forward, which I probably shouldn't be doing this often,
multiple struct tags will also be useful for parity and content ids
as a part of data redundancy and data deduplication, though I think
it's uncontroversial to consider this both heavier-weight features...
2. Becksums only work if unfilled blocks are aligned to the prog_size.
This is the whole point of crystal_size -- to provide temporary
storage for unaligned writes -- but actually aligning the block
during writes turns out to be a bit tricky without a bunch of
unecesssary btree lookups (we already do too many btree lookups!).
The current implementation here discards the pcache to force
alignment, taking advantage of the requirement that
cache_size >= prog_size, but this is corrupting our block checksums.
Code cost:
code stack
before: 31248 2792
after: 32060 (+2.5%) 2864 (+2.5%)
Also lfsr_ftree_flush needs work. I'm usually open to gotos in C when
they improve internal logic, but even for me, the multiple goto jumps
from every left-neighbor lookup into the block writing loop is a bit
much...
Looking forward, bptr checksums provide an easy mechanism to validate
data residing in blocks. This extends the merkle-tree-like nature of the
filesystem all the way down to the data level, and is common in other
COW filesystems.
Two interesting things to note:
1. We don't actually check data-level checksums yet, but we do calculate
data-level checksums unconditionally.
Writing checksums is easy, but validating checksums is a bit more
tricky. This is made a bit harder for littlefs, since we can't hold
an entire block of data in RAM, so we have to choose between separate
bus transactions for checksum + data reads, or extremely expensive
overreads every read.
Note this already exists at the metadata-level, the separate bus
transactions for rbyd fetch + rbyd lookup means we _are_ susceptible
to a very small window where bit errors can get through.
But anyways, writing checksums is easy. And has basically no cost
since we are already processing the data for our write. So we might
as well write the data-level checksums at all times, even if we
aren't validating at the data-level.
2. To make bptr checksums work cheaply we need an additional cksize
field to indicate how much data is checksummed.
This field seems redundant when we already have the bptr's data size,
but if we didn't have this field, we would be forced to recalculate
the checksum every time a block is sliced. This would be
unreasonable.
The immutable cksize field does mean we may be checksumming more data
than we need to when validating, but we should be avoiding small
block slices anyways for storage cost reasons.
This does add some stack cost because our bptr struct is larger now:
code stack
before: 31200 2768
after: 31272 (+0.2%) 2800 (+1.1%)
Instead of writing every possible config that has the potential to be
useful in the future, stick to just writing the configs that we know are
useful, and error if we see any configs we don't understand.
This prevents unnecessary config bloat, while still allowing configs to
be introduced in a backwards compatible way in the future.
Currently unknown configs are treated as a mount error, but in theory
you could still try to read the filesystem, just with potentially
corrupted data. Maybe this could be behind some sort of "FORCE" mount
flag. littlefs must never write to the filesystem if it finds unknown
configs.
---
This also creates a curious case for the hole in our tag encoding
previously taken up by the OCOMPATFLAGS config. We can query for any
config > SIZELIMIT with lookupnext, but the OCOMPATFLAGS flag would need
an extra lookup which just isn't worth it.
Instead I'm just adding OCOMPATFLAGS back in. To support OCOMPATFLAGS
littlefs has to do literally nothing, so this is really more of a
documentation change. And who know, maybe OCOMPATFLAGS will have some
weird use case in the future...
Also:
- Renamed GSTATE -> GDELTA for gdelta tags. GSTATE tags added as
separate in-device flags. The GSTATE tags were already serving
this dual purpose.
- Renamed BSHRUB* -> SHRUB when the tag is not necessarily operating
on a file bshrub.
- Renamed TRUNK -> BSHRUB
The tag encoding space now has a couple funky holes:
- 0x0005 - Hole for aligning config tags.
I guess this could be used for OCOMPATFLAGS in the future?
- 0x0203 - Hole so that ORPHAN can be a 1-bit difference from REG. This
could be after BOOKMARK, but having a bit to differentiate littlefs
specific file types (BOOKMARK, ORPHAN) from normal file types (REG,
DIR) is nice.
I guess this could be used for SYMLINK if we ever want symlinks in the
future?
- 0x0314-0x0318 - Hole so that the mdir related tags (MROOT, MDIR,
MTREE) are nicely aligned.
This is probably a good place for file-related tags to go in the
future (BECKSUM, CID, COMPR), but we only have two slots, so will
probably run out pretty quickly.
- 0x3028 - Hole so that all btree related tags (BTREE, BRANCH, MTREE)
share a common lower bit-pattern.
I guess this could be used for MSHRUB if we ever want mshrubs in the
future?
I'm just not seeing a use case for optional compat flags (ocompat), so
dropping for now. It seems their *nix equivalent, feature_compat, is
used to inform fsck of things, but this doesn't really make since in
littlefs since there is no fsck. Or from a different perspective,
littlefs is always running fsck.
Ocompat flags can always be added later (since they do nothing).
Unfortunately this really ruins the alignment of the tag encoding. For
whatever reason config limits tend to come in pairs. For now the best
solution is just leave tag 0x0006 unused. I guess you can consider it
reserved for hypothetical ocompat flags in the future.
---
This adds an rcompat flag for the grm, since in theory a filesystem
doesn't need to support grms if it never renames files (or creates
directories?). But if a filesystem doesn't support grms and a grms gets
written into the filesystem, this can lead to corruption.
I think every piece of gstate will end up with its own compat flag for
this reason.
---
Also renamed r/w/oflags -> r/w/ocompatflags to make their purpose
clearer.
---
The code impact of adding the grm rcompat flag is minimal, and will
probably be less for additional rcompat flags:
code stack
before: 31528 2752
after: 31584 (+0.2%) 2752 (+0.0%)
This turned out to not be all that useful.
Tests already take quite a bit to run, which is a good thing! We have a
lot of tests! 942.68s or ~15 minutes of tests at the time of writing to
be exact. But simply multiplying the number of tests by some number of
geometries is heavy handed and not a great use of testing time.
Instead, tests where different geometries are relevant can parameterize
READ_SIZE/PROG_SIZE/BLOCK_SIZE at the suite level where needed. The
geometry system was just another define parameterization layer anyways.
Testing different geometries can still be done in CI by overriding the
relevant defines anyways, and it _might_ be interesting there.
Since we were only registering our inotify reader after the previous
operation completed, it was easy to miss modifications that happened
faster than our scripts. Since our scripts are in Python, this happened
quite often and made it hard to trust the current state of scripts
with --keep-open, sort of defeating the purpose of --keep-open...
I think previously this race condition wasn't avoided because of the
potential to loop indefinitely if --keep-open referenced a file that the
script itself modified, but it's up to the user to avoid this if it is
an issue.
---
Also while fixing this, I noticed our use of the inotify_simple library
was leaking file descriptors everywhere! I just wasn't closing any
inotify objects at all. A bit concerning since scripts with --keep-open
can be quite long lived...
It turned out by implicitly handling root allocation in
lfsr_btree_commit_, we were never allowing lfsr_bshrub_commit to
intercept new roots as new bshrubs. Fixing this required moving the
root allocation logic up into lfsr_btree_commit.
This resulted in quite a bit of small bug fixing because it turns out if
you can never create non-inlined bshrubs you never test non-inlined
bshrubs:
- Our previous rbyd.weight == btree.weight check for if we've reached
the root no longer works, changed to an explicit check that the blocks
match. Fortunately, now that new roots set trunk=0 new roots are no
longer a problematic case.
- We need to only evict when we calculate an accurate estimate, the
previous code had a bug where eviction occurred early based only on the
progged-since-last-estimate.
- We need to manually set bshrub.block=mdir.block on new bshrubs,
otherwise the lfsr_bshrub_isbshrub check fails in mdir commit staging.
Also updated btree/bshrub following code in the dbg scripts, which
mostly meant making them accept both BRANCH and SHRUBBRANCH tags as
btree/bshrub branches. Conveniently very little code needs to change
to extend btree read operations to support bshrubs.
Unfortunately, waiting to evict shrubs until mdir compaction does not
work because we only have a single pcache. When we evict a bshrub we
need a pcache for writing the new btree root, but if we do this during
mdir compaction, our pcache is already busy handling the mdir
compaction. We can't do a separate pass for bshrub eviction, since this
would require tracking an unbounded number of new btree roots.
In the previous shrub design, we meticulously tracked the compacted
shrub estimate in RAM, determining exactly how the estimate would change
as a part of shrub carve operations.
This worked, but was fragile. It was easy for the shrub estimate to
diverge from the actual value, and required quite a bit of extra code to
maintain. Since the use cases for bshrubs is growing a bit, I didn't
want to return to this design.
So here's a new approach based on emulating btree compacts/splits inside
the shrubs:
1. When a bshrub is fetched, scan the bshrub and calculate a compaction
estimate. Store this.
2. On every commit, find the upper bound of new data being progged, and
keep track of estimate + progged. We can at least get this relatively
easily from commit attr lists. We can't get the amount deleted, which
is the problem.
3. When estimate + progged exceeds shrub_size, scan the bshrub again and
recalculate the estimate.
4. If estimate exceeds the shrub_size/2, evict the bshrub, converting it
into a btree.
As you may note, this is very close to how our btree compacts/splits
work, but emulated. In particular, evictions/splits occur at
(shrub_size/block_size)/2 in order to avoid runaway costs when the
bshrub/btree gets close to full.
Benefits:
- This eviction heuristic is very robust. Calculating the amount progged
from the attr list is relatively cheap and easy, and any divergence
should be fixed when we recalculate the estimate.
- The runtime cost is relatively small, amortized O(log n) which is
the existing runtime to commit to rbyds.
Downsides:
- Just like btree splits, evictions force our bshrub to be ~1/2 full on
average. This combined with the 2x cost for mdir pairs, the 2x cost
for mdirs being ~1/2 full on average, and the need for both a synced
and unsynced copy of file bshrubs brings our file bshrub's overhead up
to ~16x, which is getting quite high...
Anyways, bshrubs now work, and the new file topology is passing testing.
An unfortunate surprise is the jump in stack cost. This seems to come from
moving the lfsr_btree_flush logic into the hot-path that includes bshrub
commit + mdir commit + all the mtree logic. Previously the separate of
btree/shrub commits meant that the more complex block/btree/crystal logic
was on a separate path from the mdir commit logic:
code stack lfsr_file_t
before bshrubs: 31840 2072 120
after bshrubs: 30756 (-3.5%) 2448 (+15.4%) 104 (-15.4%)
I _think_ the reality is not actually as bad as measured, most of these
flush/carve/commit functions calculate some work and then commit it in
seperate steps. In theory GCC's shrinkwrapping optimizations should
limit the stack to only what we need as we finish different
calculations, but our current stack measurement scripts just add
together the whole frames, so any per-call stack optimizations get
missed...
Note this is intentionally different from how lfsr_rbyd_fetch behaves
in lfs.c. We only call lfsr_rbyd_fetch when we need validated checksums,
otherwise we just don't fetch.
The dbg scripts, on the other hand, always go through fetch, but it is
useful to be able to inspect the state of incomplete trunks when
debugging.
This use to be how the dbg scripts behaved, but they broke because of
some recent script work.
Also limited block_size/block_count updates to only happen when the
configured value is None. This matches dbgbmap.py.
Basically just a cleanup of some bugs after the rework related to
matching dbgbmap.py. Unfortunately these scripts have too much surface
area and no tests...
- Not as easy to read as --ggplot, the light shades are maybe poorly
suited for plots vs other larger block elements on GitHub. I don't
know, I'm not really a graphic designer.
- GitHub may be a moving target in the future.
- GitHub is already a moving target because it has like 9 different
optional color schemes (which is good!), so most of the time the
colors won't match anyways.
- The neutral gray of --ggplot works just as well outside of GitHub.
Worst case, --github was just a preset color palette, so it could in
theory be emulated with --foreground + --background + --font-color.
The -k/--keep-going option has been more or less useless before this
since it would completely flood the screen/logs when a bug triggers
multiple test failures, which is common.
Some things to note:
- RAM management is tricky with -k/--keep-going, if we try to save logs
and filter after running everything we quickly fill up memory.
- Failing test cases are a much slower path than successes since we need
to kill and restart the underlying test_runner, its state can't be
trusted anymore. This is a-ok since hopefully you usually hope for
many more successes than failures. Unfortunately it can make
-k/--keep-going quite slow.
---
ALSO -- warning this is a tangent rant-into-the-void -- I have
discovered that Ubuntu has a "helpful" subsystem named Apport that tries
to record/log/report any process crash in the system. It is "disabled" by
default, but the way it's disabled requires LAUNCHING A PYTHON
INTERPRETER to check a flag on every segfault/assert failure.
This is what it does when it's "disabled"!
This subsystem is fundamentally incompatible with any program that
intentionally crashes subprocesses, such as our test runner. The sheer
amount of python interpreters being launched quickly eats through all
available RAM and starts OOM killing half the processes on the system.
If anyone else runs into this, a shallow bit of googling suggests the
best solution is to just disable Apport. It is not a developer friendly
subsystem:
$ sudo systemctl disable apport.service
Removing Apport brings RAM usage back down to a constant level, even
with absurd numbers of test failures. And here I thought I had memory
leak somewhere.
Previously, any labeling was _technically_ possible, but tricky to get
right and usually required repeated renderings.
It evolved out of the way colors/formats were provided: a cycled
order-significant list that gets zipped with the datasets. This works
ok for somewhat arbitrary formatting, such as colors/formats, but falls
apart for labels, where it turns out to be somewhat important what
exactly you are labeling.
The new scheme makes the label's relationship explicit, at the cost of
being a bit more verbose:
$ ./scripts/plotmpl.py bench.csv -obench.svg \
-Linorder=0,4096,avg,bench_readed \
-Lreversed=1,4096,avg,bench_readed \
-Lrandom=2,4096,avg,bench_readed
This could also be adopted in the CSV manipulation scripts (code.py,
stack.py, summary.py, etc), but I don't think it would actually see that
much use. You can always awk the output to change names and it would add
more complexity to a set of scripts that are probably already way
over-designed.
This makes more sense when using benchmarks with sparse sampling rates.
Otherwise the rate of sampling also scales the resulting measurements
incorrectly.
If the previous behavior is required (if you want to ignore buffer sizes
when amortizing read/writes for example), the -n/--size field can always
be omitted.
Note there's a bit of subtlety here, field _types_ are still infered,
but the intention of the fields, i.e. if the field contains data vs
row name/other properties, must be unambiguous in the scripts.
There is still a _tiny_ bit of inference. For most scripts only one
of --by or --fields is strictly needed, since this makes the purpose of
the other fields unambiguous.
The reason for this change is so the scripts are a bit more reliable,
but also because this simplifies the data parsing/inference a bit.
Oh, and this also changes field inference to use the csv.DictReader's
fieldnames field instead of only inspecting the returned dicts. This
should also save a bit of O(n) overhead when parsing CSV files.
1. Being able to inspect results before benchmarks complete was useful
to track their status. It also allows some analysis even if a
benchmark fails.
2. Moving these scripts out of bench.py allows them to be a bit more
flexible, at the cost of CSV parsing/structuring overhead.
3. Writing benchmark measurements immediately avoids RAM buildup as we
store intermediate measurements for each bench permutation. This may
increase the IO bottleneck, but we end up writing the same number of
lines, so not sure...
I realize avg.py has quite a bit of overlap with summary.py, but I don't
want to entangle them further. summary.py is already trying to do too
much as is...
The whitespace sensitivity of field args was starting to be a problem,
mostly for advanced plotmpl.py usage (which tbf might be appropriately
described as "super hacky" in how it uses CLI parameters):
./scripts/plotmpl.py \
-Dcase=" \
bench_rbyd_attr_append, \
bench_rbyd_attr_remove, \
bench_rbyd_attr_fetch, \
..."
This may present problems when parsing CSV files with whitespace, in
theory, maybe. But given the scope of these scripts for littlefs...
just don't do that. Thanks.
With the quantity of data being output by bench.py now, filtering ASAP
while parsing CSV files is a valuable optimization. And thanks to how
CSV files are structured, we can even avoid ever loading the full
contents into RAM.
This does end up with use filtering for defines redundantly in a few
places, but this is well worth the saved overhead from early filtering.
Also tried to clean up the plot.py/plotmpl.py's data folding path,
though that may have been wasted effort.
This is mainly to allow bench_runner to at least compile after moving
benches out of tree.
Also cleaned up lingering runner/suite munging leftover from the change
to an optional -R/--runner parameter.
This is based on how bench.py/bench_runners have actually been used in
practice. The main changes have been to make the output of bench.py more
readibly consumable by plot.py/plotmpl.py without needing a bunch of
hacky intermediary scripts.
Now instead of a single per-bench BENCH_START/BENCH_STOP, benches can
have multiple named BENCH_START/BENCH_STOP invocations to measure
multiple things in one run:
BENCH_START("fetch", i, STEP);
lfsr_rbyd_fetch(&lfs, &rbyd_, rbyd.block, CFG->block_size) => 0;
BENCH_STOP("fetch");
Benches can also now report explicit results, for non-io measurements:
BENCH_RESULT("usage", i, STEP, rbyd.eoff);
The extra iter/size parameters to BENCH_START/BENCH_RESULT also allow
some extra information to be calculated post-bench. This infomation gets
tagged with an extra bench_agg field to help organize results in
plot.py/plotmpl.py:
- bench_meas=<meas>+amor, bench_agg=raw - amortized results
- bench_meas=<meas>+div, bench_agg=raw - per-byte results
- bench_meas=<meas>+avg, bench_agg=avg - average over BENCH_SEED
- bench_meas=<meas>+min, bench_agg=min - minimum over BENCH_SEED
- bench_meas=<meas>+max, bench_agg=max - maximum over BENCH_SEED
---
Also removed all bench.tomls for now. This may seem counterproductive in
a commit to improve benchmarking, but I'm not sure there's actual value
to keeping bench cases committed in tree.
These were alway quick to fall out of date (at the time of this commit
most of the low-level bench.tomls, rbyd, btree, etc, no longer
compiled), and most benchmarks were one-off collections of scripts/data
with results too large/cumbersome to commit and keep updated in tree.
I think the better way to approach benchmarking is a seperate repo
(multiple repos?) with all related scripts/state/code and results
committed into a hopefully reproducible snapshot. Keeping the
bench.tomls in that repo makes more sense in this model.
There may be some value to having benchmarks in CI in the future, but
for that to make sense they would need to actually fail on performance
regression. How to do that isn't so clear. Anyways we can always address
this in the future rather than now.
Before:
littlefs v2.0 0x{0,1}.232, rev 99, weight 9.256, bd 4096x256
{00a3,00a4}: 0.1 file0000 reg 32768, trunk 0xa3.a8 32768, btree 0x1a.846 32704
0.2 file0001 reg 32768, trunk 0xa3.16c 32768, btree 0xa2.be1 32704
After:
littlefs v2.0 0x{0,1}.232, rev 99, weight 9.256, bd 4096x256
{00a3,00a4}: 0.1 file0000 reg 32768, trunk 0xa3.a8, btree 0x1a.846
0.2 file0001 reg 32768, trunk 0xa3.16c, btree 0xa2.be1
Most files will have both a shrub and a btree, which makes the previous
output problematically noisy.
Unfortunately, this does lose some information: the size of the
shrub/tree, both of which may be less than the full file. But 1. this
is _technically_ redundant since you only need the block/trunk to fetch an
rbyd (though the weight is useful), and 2. The weight can still be
viewed with -s -i.
dbgbmap.py parses littlefs's mtree/btrees and displays that status of
every block in use:
$ ./scripts/dbgbmap.py disk -B4096x256 -Z -H8 -W64
bd 4096x256, 7.8% mdir, 10.2% btree, 78.1% data
mmddbbddddddmmddddmmdd--bbbbddddddddddddddbbdddd--ddddddmmdddddd
mmddddbbddbbddddddddddddddddbbddddbbddddddmmddbbdddddddddddddddd
bbdddddddddddd--ddddddddddddddddbbddddmmmmddddddddddddmmmmdddddd
ddddddddddbbdddddddddd--ddddddddddddddmmddddddddddddddddddddmmdd
ddddddbbddddddddbb--ddddddddddddddddddddbb--mmmmddbbdddddddddddd
ddddddddddddddddddddbbddbbdddddddddddddddddddddddddddddddddddddd
dddddddddd--ddddbbddddddddmmbbdd--ddddddddddddddbbmmddddbbdddddd
ddmmddddddddddmmddddddddmmddddbbbbdddddddd--ddbbddddddmmdd--ddbb
(ok, it looks a bit better with colors)
dbgbmap.py matches the layout and has the same options as tracebd.py,
allowing the combination of both to provide valuable insight into what
exactly littlefs is doing.
This required a bit of tweaking of tracebd.py to get right, mostly
around conflicting order-based arguments. This also reworks the internal
Bmap class to be more resilient to out-of-window ops, and adds an
optional informative header.
In the hack where we wait for multiple updates to fill out a full
braille/dots line we store the current pixels in a temporary array.
Unfortunately, in some cases, this is the array we modify with
updates...
A copy fixes this.
- Tried to do the rescaling a bit better with truncating divisions, so
there shouldn't be weird cross-pixel updates when things aren't well
aligned.
- Adopted optional -B<block_size>x<block_count> flag for explicitly
specifying the block-device geometry in a way that is compatible with
other scripts. Should adopt this more places.
- Adopted optional <block>.<off> argument for start of range. This
should match dbgblock.py.
- Adopted '-' for noop/zero-wear.
- Renamed a few internal things.
- Dropped subscript chars for wear, this didn't really add anything and
can be accomplished by specifying the --wear-chars explicitly.
Also changed dbgblock.py to match, this mostly affects the --off/-n/--size
flags. For example, these are all the same:
./scripts/dbgblock.py disk -B4096 --off=10 --size=5
./scripts/dbgblock.py disk -B4096 --off=10 -n5
./scripts/dbgblock.py disk -B4096 --off=10,15
./scripts/dbgblock.py disk -B4096 -n10,15
./scripts/dbgblock.py disk -B4096 0.10 -n5
Also also adopted block-device geometry argument across scripts, where
the -B flag can optionally be a full <block_size>x<block_count> geometry:
./scripts/tracebd.py disk -B4096x256
Though this is mostly unused outside of tracebd.py right now. It will be
useful for anything that formats littlefs (littlefs-fuse?) and allowing
the format everywhere is a bit of a nice convenience.
The biggest change here is the breaking up of the FLAGS config into
RFLAGS/WFLAGS/OFLAGS. This is directly inspired by, and honestly not
much more than a renaming, of the compat/ro_compat/incompat flags found
in Linux/Unix/POSIX filesystems.
I think these were first introduced in ext2? But I need to do a bit more
research on that.
RFLAGS/WFLAGS/OFLAGS provide a much more flexible, and extensible,
feature flag mechanism than the previous minor version bumps.
The (re)naming of these flags is intended to make their requirements
more clear. In order to do the relevant operation, you must understand
every flag set in the relevant flag:
- RFLAGS / incompat flags - All flags must be understood to read the
filesystem, if not understood the only possible behavior is to fail.
- WFLAGS / ro-compat flags - All flags must be understood to write to the
filesystem, if not understood the filesystem may be mounted read-only.
- OFLAGS / compat flags - Optional flags, if not understood the relevant
flag must be cleared before the filesystem can be written to, but other
than that these flags can mostly be ignored.
Some hypothetical littlefs examples:
- RFLAGS / incompat flags - Transparent compression
Is this the same as a major disk-version break? Yes kinda? An
implementation that doesn't understand compression can't read the
filesystem.
On the other hand, it's useful to have a filesystem that can read both
compressed and uncompressed variants.
- WFLAGS / ro-compat flags - Closed block-map
The idea behind a closed block-map (currently planned), is that
littlefs maintains in global space a complete mapping of all blocks in
use by the filesystem.
For such a mapping to remain consistent means that if you write to the
filesystem you must understand the closed block-map. Or in other
words, if you don't understand the closed block-map you must not write
to the filesystem.
Reading, on the other hand, can ignore many such write-related
auxiliary features, so the filesystem can still be read from.
- OFLAGS / compat flags - Global checksums
Global checksums (currently planned) are extra checksums attached to
each mdir that when combined self-validate the filesystem.
But if you don't understand global checksums, you can still read and
write the filesystem without them. The only catch is that when you write
to the filesystem, you may end up invalidating the global checksum.
Clearing the global checksum bit in the OFLAGS is a cheap way to
signal that the global checksum is no longer valid, allowing you to
still write to the filesystem without this optional feature.
Other tweaks to note:
- Renamed BLOCKLIMIT/DISKLIMIT -> BLOCKSIZE/BLOCKCOUNT
Note these are still the _actual_ block_size/block_count minus 1. The
subtle difference here was the original reason for the name change,
but after working with it for a bit, I just don't think new, otherwise
unused, names are worth it.
The minus 1 stays, however, since it avoids overflow issues at
extreme boundaries of powers of 2.
- Introduces STAGLIMIT/SATTRLIMIT, sys-attribute parallels to
UTAGLIMIT/UATTRLIMIT.
These may be useful if only uattrs are supported, or vice-versa.
- Dropped UATTRLIMIT/SATTRLIMIT to 255 bytes.
This feels extreme, but matches NAMELIMIT. These _should_ be small,
and limiting the uattr/sattr size to a single-byte leads to really
nice packing of the utag+uattrsize in a single integer.
This can always be expanded in the future if this limit proves to be a
problem.
- Renamed MLEAFLIMIT -> MDIRLIMIT and (re?)introduced MTREELIMIT.
These may be useful to limiting the mtree when needed, though it's not
clear the exact use case quite yet.
It's probably better to have a separate names for a tag category and any
specific name, but I can't think of a better name for this tag, and I
hadn't noticed that I was already ignoring the C prefix for CCKSUM tags
in many places.
NAME/CKSUM now mean both the specific tag and tag category, which is a
bit of a hack since both happen to be the 0th-subtype of their
categories.
I may be overthinking things, but I'm guessing of all the possible tag
modes we may want to add in the future, we will mostly like want to add
something that looks vaguely tag like. Like the shrub tags, for example.
It's beneficial, ordering wise, for these hypothetical future tags to
come before the cksum tags.
Current tag modes:
0x0ttt v--- tttt -ttt tttt normal tags
0x1ttt v--1 tttt -ttt tttt shrub tags
0x3tpp v-11 tttt ---- ---p cksum tags
0x4kkk v1dc kkkk -kkk kkkk alt tags
Yes, erases are the more costly operation that we should highlight. But,
aside from broken code, you can never prog more than you erase.
This makes it more useful to priortize progs over erases, so erases
without an overlaying prog show up as a relatively unique blue,
indicating regions of memory that have been erased but not progged.
Too many erased-but-not-progged regions indicate a potentially wastefull
algorithm.
Note this is already showing better code reuse, which is a good sign,
though maybe that's just the benefit of reimplementing similar logic
multiple times.
Now both reading and carving end up in the same lfsr_btree_readnext and
lfsr_btree_buildcarve functions for both btrees and shrubs. Both btrees
and shrubs are fundamentally rbyds, so we can share a lot of
functionality as long as we redirect to the correct commit function at
the last minute. This surprising opportunity for deduplication was
noticed while putting together the dbg scripts.
Planned logic (not actual function names):
lfsr_file_readnext -> lfsr_shrub_readnext
| |
| v
'---------> lfsr_btree_readnext
lfsr_file_flushbuffer -> lfsr_shrub_carve ------------.
.---------------------' |
v v
lfsr_file_flushshrub -> lfsr_btree_carve -> lfsr_btree_buildcarve
Though the btree part of the above statement is only a hypothetical at
the moment. Not even the shrubs can survive compaction now.
The reason is the new SLICE tag which needs low-level support in rbyd
compact. SLICE introduces indirect refernces to data located in the same
rbyd, which removes any copying cost associated with coalescing.
Previously, a large coalesce_size risked O(n^2) runtime when
incrementally append small amounts of data, but with SLICEs we can defer
coalescing to compaction time, where the copy is effectively free.
This compaction-time-coalescing is also hypothetical, which is why our
tests are failing. But the theory is promising.
I was originally against this idea because of how it crosses abstraction
layers, requiring some very low-level code that absolutely can not be
omitted in a simpler littlefs driver. But after working on the actual
file writing code for a while I've become convinced the tradeoff is
worth it.
Note coalesce_size will likely still need to be configurable. Data in
fragmenting/sparse btrees is still susceptible to coalescing, and it's
not clear the impacts of internal fragmentation when data sizes approach
the hard block_size/2 limit.
My current thinking is that these are conceptually different types, with
BTREE tags representing the entire btree, and BRANCH tags representing
only the inner btree nodes. We already have multiple btree tags anyways:
btrees attached to files, the mtree, and in the future maybe a bmaptree.
Having separate tags also makes it possible to store a btree in a btree,
though I don't think we'll ever use this functionality.
This also removes the redundant weight field from branches. The
redundant weight field is only a minor cost relative to storage, but it
also takes up a bit of RAM when encoding. Though measurements show this
isn't really significant.
New encodings:
btree encoding: branch encoding:
.---+- -+- -+- -+- -. .---+- -+- -+- -+- -.
| weight | | blocks |
+---+- -+- -+- -+- -+ ' '
| blocks | ' '
' ' +---+- -+- -+- -+- -+
' ' | trunk |
+---+- -+- -+- -+- -+ +---+- -+- -+- -+- -'
| trunk | | cksum |
+---+- -+- -+- -+- -' '---+---+---+---'
| cksum |
'---+---+---+---'
Code/RAM changes:
code stack
before: 30836 2088
after: 30944 (+0.4%) 2080 (-0.4%)
Also reordered other on-disk structs with weight/size, so such structs
always have weight/size as the first field. This may enable some
optimizations around decoding the weight/size without needing to know
the specific type in some cases.
---
This change shouldn't have affected functionality, but it revealed a bug
in a dtree test, where a did gets caught in an mdir split and the split
name makes the did unreachable.
Marking this as a TODO for now. The fix is going to be a bit involved
(fundamental changes to the opened-mdir list), and similar work is
already planned to make removed files work.
This is a pretty big rewrite, but is necessary to avoid "dagging".
"Dagging" (I just made this term up) is when you transform a pure tree
into a directed acyclic graph (DAG). Normally DAGs are perfectly fine in
a copy-on-write system, but in littlefs's cases, it creates havoc for
future block allocator plans, and it's interaction with parity blocks
raises some uncomfortable questions.
How does dagging happen?
Consider an innocent little btree with a single block:
.-----.
|btree|
| |
'-----'
|
v
.-----.
|abcde|
| |
'-----'
Say we wanted to write a small amount of data in the middle of our
block. Since the data is so small, the previous scheme would simply
inline the data, carving the left and right sibling (in the case the
same block) to make space:
.-----.
|btree|
| |
'-----'
.' v '.
| c' |
'. .'
v v
.-----.
|ab de|
| |
'-----'
Oh no! A DAG!
With the potential for multiple pointers to reference the same block in
our btree, some invariants break down:
- Blocks no longer have a single reference
- If you remove a reference you can no longer assume the block is free
- Knowing when a block is free requires scanning the whole btree
- This split operation effectively creates two blocks, does that mean
we need to rewrite parity blocks?
---
To avoid this whole situation, this commit adopts a new crystallization
algorithm.
Instead of allowing crystallization data to be arbitrarily fragmented,
we eagerly coalesce any data under our crystallization threshold, and if
we can't coalesce, we compact everything into a block.
Much like a Knuth heap, simply checking both siblings to coalesce has
the effect that any data will always coalesce up to the maximum size
where possible. And when checking for siblings, we can easily find the
block alignment.
This also has the effect of always rewriting blocks if we are writing a
small amount of data into a block. Unfortunately I think this is just
necessary in order to avoid dagging.
At the very least crystallization is still useful for files not quite
block aligned at the edges, and sparse files. This also avoids concerns
of random writes inflating a file via sparse crystallization.
Now when you mount littlefs, the debug print shows a bit more info:
lfs.c:7881:debug: Mounted littlefs v2.0 0x{0,1}.c63 w43.256, bd 4096x256
To dissassemble this a bit:
littlefs v2.0 0x{0,1}.c63 w43.256, bd 4096x256
^ ^ '-+-' ^ ^ ^ ^ ^
'-|-----|----|---|---|--------|---|-- major version
'-----|----|---|---|--------|---|-- minor version
'----|---|---|--------|---|-- mroot blocks
| | | | | (1st is active)
'---|---|--------|---|-- mroot trunk
'---|--------|---|-- mtree weight
'--------|---|-- mleaf weight
'---|-- block size
'-- block count
dbglfs.py also shows the block device geometry now, as read from the
mroot:
$ ./scripts/dbglfs.py disk -B4096
littlefs v2.0 0x{0,1}.c63, rev 1, weight 43.256, bd 4096x256
...
This may be over-optimizing for testing, but the reason the mount debug
is only one line is to avoid slowing down/messying test output. Both
powerloss testing and remounts completely fill the output with mount
prints that aren't actually all that useful.
Also switching to prefering parens in debug info mainly for mismatched
things.
Mainly aligning things, it was easy for the previous repr to become a
visual mess.
This also represents the config more like how we represent other tags,
since they've changed from a monolithic config block to separate
attributes.
This a compromise between padding the tag repr correctly and parsing
speed.
If we don't have to traverse an rbyd (for, say, tree printing), we don't
want to since parsing rbyds can get quite slow when things get big
(remember this is a filesystem!). This makes tag padding a bit of a hard
sell.
Previously this was hardcoded to 22 characters, but with the new file
struct printing it quickly became apparently this would be a problematic
limit:
12288-15711 block w3424 0x1a.0 3424 67 64 79 70 61 69 6e 71 gdypainq
It's interesting to note that this has only become an issue for large
trees, where the weight/size in the tag can be arbitrarily large.
Fortunately we already have the weight of the rbyd after fetch, so we
can use a heuristic similar to the id padding:
tag padding = 21 + nlog10(max(weight,1)+1)
---
Also dropped extra information with the -x/--device flag. It hasn't
really been useful and was implemented inconsistently. Maybe -x/--device
should just be dropped completely...
You can now pass -s/--structs to dbglfs.py to show any file data
structures:
$ ./scripts/dbglfs.py disk -B4096 -f -s -t
littlefs v2.0 0x{0,1}.9cf, rev 3, weight 0.256
{0000,0001}: -1.1 hello reg 128, trunk 0x0.993 128
0000.0993: .-> 0-15 shrubinlined w16 16 6b 75 72 65 65 67 73 63 kureegsc
.-+-> 16-31 shrubinlined w16 16 6b 65 6a 79 68 78 6f 77 kejyhxow
| .-> 32-47 shrubinlined w16 16 65 6f 66 75 76 61 6a 73 eofuvajs
.-+-+-> 48-63 shrubinlined w16 16 6e 74 73 66 67 61 74 6a ntsfgatj
| .-> 64-79 shrubinlined w16 16 70 63 76 79 6c 6e 72 66 pcvylnrf
| .-+-> 80-95 shrubinlined w16 16 70 69 73 64 76 70 6c 6f pisdvplo
| | .-> 96-111 shrubinlined w16 16 74 73 65 69 76 7a 69 6c tseivzil
+-+-+-> 112-127 shrubinlined w16 16 7a 79 70 61 77 72 79 79 zypawryy
This supports the same -b/-t/-i options found in dbgbtree.py, with the
one exception being -z/--struct-depth which is lowercase to avoid
conflict with the -Z/--depth used to indicate the filesystem tree depth.
I think this is a surprisingly reasonable way to show the inner
structure of files without clobbering the user's console with file
contents.
Don't worry, if clobbering is desired, -T/--no-truncate still dumps all
of the file content.
Though it's still up to the user to manually apply the sprout/shrub
overlay. That step is still complex enough to not implement in this
tool yet.
I
Ended up changing the name of lfsr_mtree_traversal_t -> lfsr_traversal_t,
since this behaves more like a filesytem-wide traversal than an mtree
traversal (it returns several typed objects, not mdirs like the other
mtree functions for one).
As a part of this changeset, lfsr_btraversal_t (was lfsr_btree_traversal_t)
and lfsr_traversal_t no longer return untyped lfsr_data_ts, but instead
return specialized lfsr_{b,t}info_t structs. We weren't even using
lfsr_data_t for its original purpose in lfsr_traversal_t.
Also changed lfsr_traversal_next -> lfsr_traversal_read, you may notice
at this point the changes are intended to make lfsr_traversal_t look
more like lfsr_dir_t for consistency.
---
Internally lfsr_traversal_t now uses a full state machine with its own
enum due to the complexity of traversing the filesystem incrementally.
Because creating diagrams is fun, here's the current full state machine,
though note it will need to be extended for any
parity-trees/free-trees/etc:
mrootanchor
|
v
mrootchain
.-' |
| v
| mtree ---> openedblock
'-. | ^ | ^
v v | v |
mdirblock openedbtree
| ^
v |
mdirbtree
I'm not sure I'm happy with the current implementation, and eventually
it will need to be able to handle in-place repairs to the blocks it
sees, so this whole thing may need a rewrite.
But in the meantime, this passes the new clobber tests in test_alloc, so
it should be enough to prove the file implementation works. (which is
definitely is not fully tested yet, and some bugs had to be fixed for
the new tests in test_alloc to pass).
---
Speaking of test_alloc.
The inherent cyclic dependency between files/dirs/alloc makes it a bit
hard to know what order to test these bits of functionality in.
Originally I was testing alloc first, because it seems you need to be
confident in your block allocator before you can start testing
higher-level data structures.
But I've gone ahead and reversed this order, testing alloc after
files/dirs. This is because of an interesting observation that if alloc
is broken, you can always increase the test device's size to some absurd
number (-DDISK_SIZE=16777216, for example) to kick the can down the
road.
Testing in this order allows alloc to use more high-level APIs and
focus on corner cases where the allocator's behavior requires subtlety
to be correct (e.g. ENOSPC).
Still needs testing, though the byte-level fuzz tests were already causing
blocks to crystallize. I noticed this because of test failures which are
fixed now.
Note the block allocator currently doesn't understand file btrees. To
get the current tests passing requires -DDISK_SIZE=16777216 or greater.
It's probably also worth noting there's a lot that's not implemented
yet! Data checksums and write validation for one. Also ecksums. And we
should probably have some sort of special handling for linear writes so
linear writes (the most common) don't end up with a bunch of extra
crystallizing writes.
Also the fact that btrees can become DAGs now is an oversight and a bit
concerning. Will that work with a closed allocator? Block parity?
So now instead of needing:
./scripts/test.py ./runners/test_runner test_dtree
You can just do:
./scripts/test.py test_dtree
Or with an explicit path:
./scripts/test.py -R./runners/test_runner test_dtree
This makes it easier to run the script manually. And, while there may be
some hiccups with the implicit relative path, I think in general this will
make the test/bench scripts easier to use.
There was already an implicit runner path, though only if the test suite
was completely omitted. I'm not sure that would ever have actually
been useful...
---
Also increased the permutation field size in --list-*, since I noticed it
was overflowing.
Previously our lower/upper bounds were initialized to -1..weight. This
made a lot of the math unintuitive and confusing, and it's not really
necessary to support -1 rids (-1 rids arise naturally in order-statistic
trees the can have weight=0).
The tweak here is to use lower/upper bounds initialized to 0..weight,
which makes the math behave as expected. -1 rids naturally arise from
rid = upper-1.
- Added shrub tags to tagrepr
- Modified dbgrbyd.py to use last non-shrub trunk by default
- Tweaked dbgrbyd's log mode to find maximum seen weight for id padding