This matches how diff percentages are rendered, and simplifies the
internal table rendering by making Frac less of a special case. It also
allows for other type notes in the future.
One concern is how all the notes are shoved to the side, which may make
it a bit harder to find related percentages. If this becomes annoying we
should probably look into interspersing all notes (including diff
percentages) between the relevant columns.
Before:
function lines branches
lfsr_rbyd_appendattr 230/231 99.6% 172/192 89.6%
lfsr_rbyd_p_recolor 33/34 97.1% 11/12 91.7%
lfs_alloc 40/42 95.2% 21/24 87.5%
lfsr_rbyd_appendcompaction 54/57 94.7% 39/42 92.9%
...
After:
function lines branches
lfsr_rbyd_appendattr 230/231 172/192 (99.6%, 89.6%)
lfsr_rbyd_p_recolor 33/34 11/12 (97.1%, 91.7%)
lfs_alloc 40/42 21/24 (95.2%, 87.5%)
lfsr_rbyd_appendcompaction 54/57 39/42 (94.7%, 92.9%)
...
Previously, with -d/--diff, we would only show non-zero percentages. But
this was ambiguous/confusing when dealing with multiple results
(stack.py, summary.py, etc).
To help with this, I've switched to showing all percentages unless all
percentages are zero (no change). This matches the -d/--diff row-hiding
logic, so by default all rows should show all percentages.
Note -p/--percent did not change, as it already showed all percentages
all of the time.
Noticed weird slowness when summarizing test results by suite vs case.
Turns out the way we accumulate results by overloading Python's __add__
quickly leads to O(n^2) behavior as we repeatedly concatenate
increasingly large lists.
Instead of doing anything sane, I've added a second, immutable length to
each list such that we can opportunistically reuse/mutate/append lists
in __add__. The end result should be O(n) most of the time.
Observe:
lines bytes
test.csv: 537749 64551874 62MiB
./scripts/summary.py test.csv -ftest_time -S
before after
-bcase: 0m51.772s 0m9.302s (-82.0%)
-bsuite: 10m29.067s 0m9.357s (-98.5%)
The main idea here is that diverse tests are better than many similar
tests.
Sure, if we throw fuzz tests at the system all day we'll eventually find
more bugs, but if a developer is in the loop that time is going to be
better spent writing specific tests targeting the fragile parts of the
system.
And don't worry, we can still throw fuzz tests at the system all day by
specifying explicit seeds with -DSEED=blah.
Changes:
- Limited dir-related powerloss fuzz testing to N <= 16.
These tests were the biggest culprit of excessive test runtime,
requiring O(n^2) redundant operations to recover from powerlosses
(they just replay the full sequence on powerloss).
- As a tradeoff, bumped most fuzz tests to a minimum of 20 seeds.
The big exception being the test_fwrite tests, which are heavily
parameterized and already take the most time to run. Each parameter
combination also multiplies the effective number of seeds, so
increasing the number of base seeds will probably have diminishing
returns.
- Limited test_fwrite_reversed to SIZE <= 4*1024*CHUNK.
Writing a file backwards is just about the worst way you could write a
file, since all buffering/coalescing expect writes to eventually make
forward progress. On the flip side, because it's uncommon, writing a
file backwards is also a great way to find bugs. But at some point a
compromise needs to be made.
Impacted test runtimes:
case otime ntime dtime
test_btree_push_fuzz 0.3 0.5 +0.2 (+60.2%)
test_btree_push_sparse_fuzz 0.4 3.3 +2.9 (+720.4%)
test_btree_update_fuzz 0.4 0.9 +0.6 (+141.6%)
test_btree_update_sparse_fuzz 0.5 4.5 +4.1 (+857.4%)
test_btree_pop_fuzz 0.6 2.3 +1.7 (+314.7%)
test_btree_pop_sparse_fuzz 1.2 5.7 +4.4 (+356.2%)
test_btree_split_fuzz 0.5 1.4 +0.8 (+150.2%)
test_btree_split_sparse_fuzz 0.4 5.6 +5.1 (+1163.2%)
test_btree_find_fuzz 0.5 0.7 +0.2 (+50.7%)
test_btree_find_sparse_fuzz 1.0 3.0 +2.0 (+189.8%)
test_btree_traversal_fuzz 0.6 2.3 +1.6 (+260.4%)
test_dirs_mkdir_many 3.3 2.1 -1.3 (-37.8%)
test_dirs_mkdir_many_backwards 3.5 2.1 -1.4 (-39.9%)
test_dirs_mkdir_fuzz 115.3 106.4 -8.9 (-7.7%)
test_dirs_rm_many 283.9 76.8 -207.0 (-72.9%)
test_dirs_rm_many_backwards 216.1 80.6 -135.5 (-62.7%)
test_dirs_rm_fuzz 647.0 68.5 -578.5 (-89.4%)
test_dirs_mv_many 14.2 15.4 +1.1 (+7.9%)
test_dirs_mv_many_backwards 16.5 14.5 -2.1 (-12.5%)
test_dirs_mv_fuzz 1932.5 156.7 -1775.8 (-91.9%)
test_dirs_general_fuzz 561.9 74.5 -487.4 (-86.7%)
test_dread_recursive_rm 336.6 46.2 -290.4 (-86.3%)
test_dread_recursive_mv 55.5 44.6 -11.0 (-19.8%)
test_fsync_rrrr_fuzz 0.4 0.3 -0.1 (-18.4%)
test_fsync_wrrr_fuzz 8.0 12.4 +4.5 (+56.0%)
test_fsync_wwww_fuzz 13.2 33.4 +20.2 (+152.6%)
test_fsync_wwrr_fuzz 5.4 50.9 +45.5 (+841.6%)
test_fsync_rwrw_fuzz 2.4 8.4 +6.0 (+253.9%)
test_fsync_rwrw_sparse_fuzz 3.2 7.5 +4.2 (+129.9%)
test_fsync_rwtfrwtf_sparse_fuzz 6.1 8.5 +2.4 (+39.3%)
test_fsync_drrr_fuzz 11.8 9.2 -2.6 (-21.8%)
test_fsync_wddd_fuzz 9.3 11.9 +2.6 (+28.0%)
test_fsync_rwdrwd_fuzz 1.6 33.1 +31.5 (+1963.4%)
test_fsync_rwdrwd_sparse_fuzz 0.3 1.8 +1.4 (+418.8%)
test_fsync_rwtfdrwtfd_sparse_fuzz 0.3 1.1 +0.8 (+260.2%)
test_fwrite_reversed 728.5 345.2 -383.3 (-52.6%)
TOTAL 7587.5 3792.3 -3795.2 (-50.0%)
Now that we have ~20 minutes of tests, it's good to know _why_ the tests
take ~20 minutes, and if this time is being spent well.
This adds the field test_time to test.py's -o/--output, which reports
the runtime of each test in seconds. This can be organized by suite,
case, etc, with our existing csv scripts.
Note I've limited the precision to only milliseconds (%.6f).
Realistically, this is plenty of precision, and with the number of tests
we have extra digits can really add up!
lines bytes
test.csv before: 525593 58432541 56MiB
test.csv full precision: 525593 (+0.0%) 69817693 67MiB (+19.5%)
test.csv milli precision: 525593 (+0.0%) 63162935 60MiB (+8.1%)
It still takes a bit of time to process this (50.3s), but now we can see
the biggest culprits of our ~20 minute test time:
$ ./scripts/summary.py test.csv -bcase -ftest_time -S
case test_time
...
test_fwrite_hole_compaction 74.4
test_fwrite_incr 109.7
test_dirs_mkdir_fuzz 115.3
test_fwrite_overwrite_compaction 132.4
test_rbyd_fuzz_append_removes 134.0
test_rbyd_fuzz_mixed 136.3
test_rbyd_fuzz_sparse 137.4
test_fwrite_w_seek 144.1
test_rbyd_fuzz_create_deletes 144.8
test_dirs_rm_many_backwards 208.4
test_dirs_rm_many 273.8
test_fwrite_fuzz_unaligned 283.2
test_dread_recursive_rm 316.7
test_fwrite_fuzz_aligned 551.0
test_dirs_general_fuzz 552.8
test_dirs_rm_fuzz 632.7
test_fwrite_reversed 719.0
test_dirs_mv_fuzz 1984.8
TOTAL 7471.3
Note this machine has 6 cores, 12 hthreads, 7471.3/60/6 => 20.8m, which
is why I don't run these tests single threaded.
It's really frustrating that it's impossible to create an uninitialized
expression with the scope of a compound-literal...
(I'm going to ignore that this is technically possible with alloca.)
The lack of uninitialized compound-literals forces each of our attribute
lists to make a decision: 1. Use an implicit buffer and pay for
zero-initialization? or 2. Use an explicit buffer, adding code noising
and risking out-of-date buffer sizes.
As a compromise, this commit adopts explicit buffers in most of the
low/mid-level layers. Where the code is already pretty noisy, but also
heavily scrutinized and iterated over to reduce code/stack costs. This
leaves the high-level layers with the hopefully safer and more readable
implicit buffers.
You can see this zero initializing has a surprisingly high code cost,
for what is otherwise a noop:
code stack
before: 33828 2632
after: 33656 (-0.5%) 2632 (+0.0%)
Now it is fit for purpose and can replace the explicit tag comparison +
assert in lfsr_rbyd_appendattr. Previously we had to check if delta==0,
but now we just assert that delta!=0 is invalid for noops.
Unfortunately this added a couple bytes of code. The disassembly for
lfsr_rbyd_appendattr is all shuffled up, so I guess this is just
compiler noise. At least it's better than an explicit delta check:
code stack
before: 33820 2632
check delta: 33832 (+0.0%) 2632 (+0.0%)
assert delta: 33828 (+0.0%) 2632 (+0.0%)
Turns out temporary struct-literals have a slightly better code/stack
footprint than array-literals. I guess because nuances around arrays in
C can cause problems for optimization passes?
This makes forcing lvalues for macro consistency much more appealing:
code stack
sometimes rvalues: 33780 2640
array lvalues (before): 33868 (+0.3%) 2640 (+0.0%)
struct lvalues (after): 33820 (+0.1%) 2632 (-0.3%)
I think what may be going on with the unexpected stack cost related to
struct passing, is something to do with scoping and how it interacts
with function inlining + shrink wrapping.
Compound-literals have a scope limited by the current statement, and
while temporary structs _should_ have a scope limited to the current
expressions, maybe this scope is getting messed up due to function
inlining?
Still smells like a compiler bug, but if this is true, wrapping the
struct-generating function calls with compound-literals should be more
robust at preventing unexpected stack increases in the future.
As a plus, this makes all LFSR_DATA/ATTR macros lvalues, which is nice
for consistency.
---
Unfortunately, it does seem like GCC 11 is not able to elide moving
compound-literals all that well. Repeatedly nesting trivial
compound-literals results in a measurable increase in code cost, even
though it should theoretically be a noop with optimizations.
This results in an unfortunate code size increase:
code stack
before: 33780 2640
after: 33868 (+0.3%) 2640 (+0.0%)
But at some point you have to give up trying to work around
insufficiencies in the compiler. I'll take 100 bytes of code over 100
bytes of stack any day.
This turned into a sort of system-wide refactor based on learned
knowledge of what we can do with lfsr_attr_t.
The big changes:
- Reverted LFSR_ATTR to mainly take lfsr_data_t again, keeping
lfsr_data_t as the default data representation in the codebase.
Now that we know
LFSR_ATTR_CAT_ still provides concatenation mechanics, and LFSR_ATTR_
provides a way to edit in-flight lfsr_attr_ts.
- Dropped lfsr_cat_t, replaced with explicit const void* + uint16_t,
tried to limit to low-level operations and prefer passing aroud
lfsr_attr_t and lfsr_data_t at a high-level.
Note this cat + cat_count pair is quite similar to the common attrs +
attr_count and buffer + size arguments.
- Adopted lfsr_attr_t more in mid-level functions, lfsr_rbyd_appendattr,
lfsr_rbyd_appendcompactattr, lfsr_file_carve, etc. This is a bit more
ergonomical, allows for use of LFSR_ATTR* macros, and in theory might
even save a bit of stack.
Unfortunately this seems to have resulted in a net hit to code cost,
though I still think it's worth it for the internal ergonomics:
code stack
before: 33652 2624
after: 33780 (+0.4%) 2640 (+0.4%)
Investigating further suggests this may just be the result of compiler
noise and changes to argument placement. lfsr_attr_t does touch a lot of
code...
It's interesting to note the adoption of lfsr_attr_t in
lfsr_rbyd_appendattr* and friends prevents their transformation into
.isra functions, though this doesn't seem to impact code cost too much:
function (5 added, 5 removed) osize nsize dsize
lfsr_cat_size - 48 +48 (+100.0%)
lfsr_file_carve - 1600 +1600 (+100.0%)
lfsr_rbyd_appendattr - 2120 +2120 (+100.0%)
lfsr_rbyd_appendattr_ - 244 +244 (+100.0%)
lfsr_rbyd_appendcompactattr - 68 +68 (+100.0%)
lfsr_rbyd_appendcompactrbyd 144 152 +8 (+5.6%)
lfsr_file_truncate 298 314 +16 (+5.4%)
lfsr_mdir_commit__ 1056 1112 +56 (+5.3%)
lfsr_mdir_compact__ 502 526 +24 (+4.8%)
lfsr_rbyd_appendattrs 132 138 +6 (+4.5%)
lfsr_file_fruncate 386 402 +16 (+4.1%)
lfsr_data_frombtree 84 86 +2 (+2.4%)
lfsr_rbyd_appendcksum 512 520 +8 (+1.6%)
lfsr_file_opencfg 572 580 +8 (+1.4%)
lfsr_rename 608 616 +8 (+1.3%)
lfsr_mkdir 500 504 +4 (+0.8%)
lfsr_bd_prog 278 280 +2 (+0.7%)
lfsr_mdir_commit 2364 2360 -4 (-0.2%)
lfsr_bshrub_commit 716 712 -4 (-0.6%)
lfsr_file_sync 526 514 -12 (-2.3%)
lfsr_file_flush_ 1868 1820 -48 (-2.6%)
lfsr_remove 456 436 -20 (-4.4%)
lfsr_fs_fixgrm 168 160 -8 (-4.8%)
lfsr_cat_size.isra.0 42 - -42 (-100.0%)
lfsr_file_carve.isra.0 1596 - -1596 (-100.0%)
lfsr_rbyd_appendattr.isra.0 2088 - -2088 (-100.0%)
lfsr_rbyd_appendattr_.isra.0 232 - -232 (-100.0%)
lfsr_rbyd_appendcompactattr.isra.0 56 - -56 (-100.0%)
TOTAL 33652 33780 +128 (+0.4%)
If we have control of the struct, such as in lfsr_data_t and lfsr_cat_t,
moving the common tag outside of the union avoids naming ambiguities.
Counter-example: This doesn't work for lfsr_bshrub_t, since the contents
of that union are also used as separate types elsewhere. Fortunately the
common initial sequence union rules kick in here.
No code changes, which is good:
code stack
before: 33652 2624
after: 33652 (+0.0%) 2624 (+0.0%)
Now that lfsr_attr_t is "small", or at least the same size as
lfsr_data_t, it makes sense to change the helper functions to take
lfsr_attr_t by value for consistency. These should all be inlined
anyways.
It's interesting to note there _are_ appendattr/progattr functions, but
these don't take lfsr_attr_t directly since we usually do some
last-minute modification to the attr's weight/tag.
Cost cost is mostly unchanged, actually shaves off a few bytes, which is
a good sign:
code stack
before: 33664 2624
after: 33652 (-0.0%) 2624 (+0.0%)
Compound-literals weren't the culprit after all! It was... RVO
interactions with inlined function arguments?
To be honest I still don't quite understand what's going on, but I
present to you this madness:
code stack
before: 33664 2624
after: 33664 (+0.0%) 2624 (+0.0%)
I think I'm understanding a bit more how RVO interacts with inline
functions. And by that I mean I'm learning that the way RVO interacts
with inline functions is unfortunately very cursed...
Just take a look at this diff. This change should be a noop. But somehow
it saves 200 bytes of RAM:
code stack
before: 33684 2824
after: 33664 (-0.1%) 2624 (-7.1%)
I think what's happening is passing the result of lfsr_data_from* into
lfsr_data_cat is somehow preventing RVO, because the parameter would
need to be copied into the right argument slot? (argument registers?)
But we really don't need a copy, because lfsr_data_cat should end up
inlined. By inserting a compound literal, we force RVO, and all of these
unnecessary copies get cleaned up after lfsr_data_cat is inlined.
Keep in mind, in a perfect world, lfsr_data_cat should be a noop.
But I could be wrong about all of this. It's not really clear what the
compiler is doing, and I haven't dived that far into the disassembly...
The name is not super important, but note lfsr_data_cat matches
lfsr_attr_cat, which is a nice bit of consistency.
The main change here is the adoption of correct field assignments
instead of a hacky cast forcing lfsr_data_t -> lfsr_cat_t. Tests were
passing even with optimizations, but I was concerned about the longevity
of this approach.
As a plus, we can actually assert on size fitting into a uint16_t thanks
to the inline function.
Unfortunately, this creates a surprising stack penalty:
code stack
before: 33756 2624
after: 33684 (-0.2%) 2824 (+7.6%)
I've also played around with instead reverting lfsr_data_from* ->
lfsr_cat_from*, and providing the inverse lfsr_cat_data, but nothing
gets us quite back to LFSR_CAT_DAT stack:
code stack
before: 33756 2624
lfsr_data_cat: 33684 (-0.2%) 2824 (+7.6%)
lfsr_cat_data: 33872 (+0.3%) 2736 (+4.3%)
This needs more investigation. Unfortunately I don't think we can revert
this, since correctness wins over code/stack costs...
Now that compound-literals have been identified as the culprit, we can
actually adopt this smaller lfsr_attr_t representation without a random
code/stack increase.
This limits lfsr_cat_t's size field to 16-bits (15-bit size + 1-bit
for concatenated datas), allowing simple small attrs (the most common)
to save a word of RAM:
lfsr_tag_t lfsr_attr_t
.---+---. .---+---+---+---.
| tag |-----------+--->| tag |c|size |
'---+---' | +---+---+---+---+
.-|--->| delta |
lfsr_srid_t | | +---+---+---+---+
.---+---+---+---. | | .->| ptr |
| delta |-' | | '---+---+---+---'
'---+---+---+---' | |
| |
lfsr_cat_t | |
.---+---+---+---. | |
|c|size |-----------' |
+---+---+---+---+ |
| ptr |-----'
'---+---+---+---'
The non-trivial mapping of lfsr_cat_t to lfsr_attr_t does mean a bit
more complexity on lfsr_cat_t access, but now that we figured out the
compound-literal cost it seems the compiler is able to mostly elide
these.
The end result is some nice stack savings:
code stack
before: 33812 2712
after: 33756 (-0.2%) 2624 (-3.2%)
I've been fiddling around with our LFSR_ATTR macro to try to understand
why making it an inline function costs so much, and it seems like it's
not actually the inline function, but the compound literal that is the
problem. Specifically, returning a compound literal from an inline
function results in surprisingly poor code/stack costs!
I don't really know why this happens. Compiler bug/oversight related to
lvalues/rvalues? Compound literals interfering with RVO? Unsure.
I tried a few other struct initializers just in case it was related to
constness, but it seems the problem is the compound literal:
Inlined comp-lit:
return (lfsr_attr_t){tag, delta, cat};
Inlined const comp-lit:
return (const lfsr_attr_t){tag, delta, cat};
Inlined no-init:
lfsr_attr_t attr;
attr.tag = tag;
attr.delta = delta;
attr.cat = cat;
return attr;
Inlined init:
lfsr_attr_t attr = {tag, delta, cat};
return attr;
Code/stack sizes:
code stack
macro (before): 33852 2776
inline comp-lit: 34140 (+0.9%) 2760 (-0.6%)
inline const comp-list: 34140 (+0.9%) 2760 (-0.6%)
inline no-init (after): 33812 (-0.1%) 2712 (-2.3%)
inline init: 33812 (-0.1%) 2712 (-2.3%)
The good news is this at least offers a route forward for crammed 15-bit
attrs.
I guess we should also go reasses other uses of compound literals in the
codebase...
When we're not dealing with cats/attrs, it seems useful to still have
lfsr_data_from* functions that return our general purpose lfsr_data_t
type.
Added LFSR_CAT_DAT to cheaply convert from lfsr_data_t -> lfsr_cat_t
(at least for simple buffers), and used this to keep the LFSR_CAT_*
macros, which are useful for attr-list construction.
Unfortunately, because lfsr_data_t is 3-words vs lfsr_cat_t 2-words,
this does add both code and stack cost:
code stack
before: 33672 2744
after: 33852 (+0.5%) 2776 (+1.2%)
It's interesting to note this is _not_ because of any LFSR_CAT_* usage!
I tested this explicitly and lfsr_data_from* -> LFSR_CAT_DAT adds no
cost over the previous lfsr_cat_from* functions. A win for GCC. This
cost only comes from the direct usage of the returns lfsr_data_t types
in our grm handling and branch -> btree encoding.
Still it's an annoying cost... Maybe this should be reverted? The
nuances of lfsr_cat_t vs lfsr_data_t is a bit annoying.
So, for example, these are equivalent:
lfsr_cat_t cat = LFSR_CAT_BPTR(bptr);
uint8_t buf[LFSR_BPTR_DSIZE];
lfsr_cat_t cat = LFSR_CAT_BPTR_(bptr, buf);
The first leads to more readable code, but of course sometimes you need
explicit memory allocations.
This replaces lfsr_cat_frombptr, etc, though those functions are still
available. This name change is more relevant for LFSR_CAT_DATA/DATAS,
which involve bit more complicated macros.
I think this was becoming too over-engineered. That and if you're
fighting the compiler too much, sometimes it's best to just let the
compiler win.
There is also the minor benefit of not needing to worry about uint16_t
overflows in lfsr_cat_t. One concern is it's easy for missed lfsr_data_t
indirections to go unnoticed.
This reverts lfsr_attr_t from 3 words -> 4 words, but allows easy
mapping to the lfsr_cat_t part of the lfsr_attr_t:
lfsr_tag_t lfsr_attr_t
.---+---. .---+---+---+---. . . .---+---+---+---.
| tag |--------------->| tag | | | tag | |
'---+---' +---+---+---+---+ +---+---+---+---+
.----->| delta | | delta |
lfsr_srid_t | +---+---+---+---+ +---+---+---+---+
.---+---+---+---. | .--->|c| size | | cat |
| delta |-' | +---+---+---+---+ | |
'---+---+---+---' | .->| ptr | | |
| | '---+---+---+---' . . '---+---+---+---'
lfsr_cat_t | |
.---+---+---+---. | |
|c| size |---' |
+---+---+---+---+ |
| ptr |-----'
'---+---+---+---'
This doesn't impact stack as much as you would expect due to compiler
overhead around inline functions operating on multi-word structs:
before 15-bit: 33672 2776
best 15-bit: 33900 (+0.7%) 2736 (-1.4%)
prev 15-bit: 34486 (+2.4%) 2912 (+4.9%)
after: 33672 (+0.0%) 2744 (-1.2%)
I think the only big thing not reverted was switching lfsr_cat_t from
storing the sum of the lfsr_data_ts sizes to just storing the number of
lfsr_data_ts. This is cheaper to initialize (because inline functions
are weirdly costly!), cheaper to iterate over, and we don't actually
need lfsr_cat_size that often, unlike lfsr_data_size.
Eventually we'll probably need to revisit this to optimize 31/15-bit,
etc, littlefs configurations. But that's a problem for another day...
The idea here was to carve out explicit space for the lfsr_attr_t tag in
lfsr_cat_t, so that when we create lfsr_attr_ts we don't need multiple
references to lfsr_cat_t and can revert back to a macro:
lfsr_tag_t lfsr_attr_t
.---+---. .---+---+---+---. . . .---+---+---+---.
| tag |-------. .----->| delta | | delta |
'---+---' | | +---+---+---+---+ +---+---+---+---+
'-|-+--->| tag |c|size | | cat |
lfsr_srid_t | | +---+---+---+---+ | |
.---+---+---+---. | | .->| ptr | | |
| delta |-' | | '---+---+---+---' . . '---+---+---+---'
'---+---+---+---' | |
| |
lfsr_cat_t | |
.---+---+---+---. | |
| (tag) |c|size |---' |
+---+---+---+---+ |
| ptr |-----'
'---+---+---+---'
But as a part of creating the lfsr_attr_t we need to mutate the tag in
lfsr_cat_t, which ended up still needing a static inline function, and
ended up making the code/stack size even worse!
code stack
before: 33900 2736
after: 34486 (+1.7%) 2912 (+6.4%)
I was hoping to leverage the last-update rule of C99's designated
initializers, but this weird rule didn't quite work how I expected. The
compiler is free to omit earlier struct initializers, even if a later
initializer only partially initializes the struct.
Another interesting observation about lfsr_cat_t: We rarely actually
need to represent the full range of data:
- lfsr_file_write:block - Block writes call lfsr_bd_prog directly, don't
need to be represented as attrs.
- lfsr_file_write:fragment - Fragments already need the full
concatenated representation because of, uh, potential concatenation.
- lfsr_file_sync:sprouts - Inlined sprouts may use the full range.
Fortunately we can union an lfsr_data_t with the btree buffer, so
no extra stack cost.
- lfsr_getuattr/sattr (planned) - User/sys attributes may use the full
range. But these will probably not be on the stack hot-path.
Most attrs that use the simple buffer lfsr_cat_t representation are used
to encode internal structs, such as leb128s, ecksum, bptr, mptr, etc.
The largest of these right now is our bptr encoding, at 21 bytes, so
these easily fit in a short or a byte.
The choice of 15-bits (reserving one bit for cat/buf representation), is
convenient as it allows us to fit the cat size next to our 16-bit
lfsr_tag_t for free in 32-bit aligned systems:
lfsr_attr_t
.---+---+---+---.
| tag |c|size |
+---+---+---+---+
| delta |
+---+---+---+---+
| ptr |
'---+---+---+---'
Unfortunately, it seems like C _really_ wants to fight us on this one.
Not really because of the struct packing, but because of how we want
lfsr_attr_t to interact with lfsr_cat_t:
lfsr_tag_t lfsr_attr_t
.---+---. .---+---+---+---.
| tag |-----------+--->| tag |c|size |
'---+---' | +---+---+---+---+
.-|--->| delta |
lfsr_srid_t | | +---+---+---+---+
.---+---+---+---. | | .->| ptr |
| delta |-' | | '---+---+---+---'
'---+---+---+---' | |
| |
lfsr_cat_t | |
.---+---+---+---. | |
|c|size |-----------' |
+---+---+---+---+ |
| ptr |-----'
'---+---+---+---'
Initializing two fields with one argument is frustratingly impossible in
C99 unless you want to duplicate the argument tree, which we
_definitely_ don't want to do because this includes the actual data
encoding steps.
The only option is to use a static inline function. You might say "oh,
but static inline costs the same as a macro". But no. Switching to a
static inline function heavily penalizes this approach (6% of stack!):
code stack
macro: 33672 2776
inline: 34512 (+2.5%) 2952 (+6.3%)
after: 33900 (+0.7%) 2736 (-1.4%)
At least this does result in net stack savings, even with the inline
function penalty.
I'm committing this temporarily mostly just to record some _very_
interesting code/stack measurements.
This explores replacing the LFSR_ATTR macro with an inline function that
does the same thing:
#define LFSR_ATTR(_tag, _delta, _cat) \
((const lfsr_attr_t){_tag, _delta, _cat})
vs:
#define LFSR_ATTR(_tag, _delta, _cat) \
lfsr_attr(_tag, _delta, _cat)
static inline lfsr_attr_t lfsr_attr(
lfsr_tag_t tag, lfsr_srid_t delta, lfsr_cat_t cat) {
return (lfsr_attr_t){tag, delta, cat};
}
The motivation for this is to eventually support more complex
lfsr_attr_t layouts. Specifically, it would be nice if we could break up
the lfsr_cat_t into separate size/ptr fields. Unfortunately we can't
declare temporaries in macros (I wish we had statement expressions), and
we really don't want to duplicate the entire cat tree, so an inline
function seems like the only way to accomplish this...
But static inline functions have the same cost as a macro you say?
No. This assumes a perfect compiler. And it's pretty unfair to compiler
developers to expect a perfect compiler.
To be fair, this is an extremely harsh test. We use LFSR_ATTR _heavily_,
which is why it's getting this much scrutiny. lfsr_attr also both takes
in a 2-word struct, and returns a _4_-word struct, which probably makes
things messy.
Still, the results are concerning:
code stack
macro: 33672 2776
inline: 34512 (+2.5%) 2952 (+6.3%)
always_inline: 34512 (+2.5%) 2952 (+6.3%)
noinline: 33920 (+0.7%) 2888 (+4.0%)
Measured with GCC 11 -mthumb -Os. I also measured with
__attribute__((always_inline/noinline)) just to see how that affected
things.
With LFSR_CAT_DATAS for explicit arrays of datas, and this biggest use
of concatenated data being a rather explicit construction in
lfsr_file_carve, I don't think we really need LFSR_CAT_CAT.
The only non-hacky use was to define LFSR_CAT_NAME. But we know names
always use exactly 2 datas, so this might as well use LFSR_CAT_DATAS.
I am going to use this soapbox to complain a bit about compound struct
literals. Why do we need an array declaration to elevate temporary
structs to automatic storage duration? I wish you could init a compound
literal with the struct itself...
✗ &f()
✓ &(uint32_t){f()}
✓ (uint32_t[]){f()}
✗ &f()
✗ &(lfsr_data_t){f()} :(
✓ (lfsr_data_t[]){f()}
Some hacky compound array literals were needed to replace the hacky
LFSR_CAT_CATs in lfsr_file_carve for this reason, but I guess it's a
hack for a hack so...
Code unchanged:
code stack
before: 33728 2776
after: 33728 (+0.0%) 2776 (+0.0%)
This attr-list allocator stuff is becoming over-engineered, these
allocations really aren't that complex...
This may have simplified after removing becksums, but if we need that
complexity again we can cross that bridge when we get to it.
Hand-allocating, dropping buf_size/data_count tracking, and refactoring
lfsr_file_carve to pre-encode the right sibling gives us a nice bit of
code/stack savings:
code stack
before: 33796 2808
after: 33728 (-0.2%) 2776 (-1.1%)
lfsr_data_t datas[d]; => union {
lfs_size_t data_count; lfsr_data_t data;
uint8_t buf[b]; uint8_t buf[b'];
lfs_size_t buf_size; } datas[d+b];
lfs_size_t data_count;
This trades off extra bookeeping (data_count + buf_size vs data_count)
for less-tight stack overhead.
But this also saves a significant amount of RAM in lfsr_file_carve,
where we have exclusive fragments/bptrs for our left and right siblings.
So the end stack cost/savings mostly cancel out.
The end result seems like a net benefit for code cost:
code stack
before: 33872 2816
after: 33796 (-0.2%) 2808 (-0.3%)
lfsr_attr_t attrs[a*d*b]; => lfsr_attr_t attrs[a];
lfs_size_t attr_count; lfs_size_t attr_count;
lfs_size_t attr_scratch; lfsr_data_t datas[d];
lfs_size_t data_count;
uint8_t buf[b];
lfs_size_t buf_size;
This mostly reverts the allocator scaffolding needed for the attr-list
arenas (LFS_ALIGNOF, etc). This is the main draw of this change, as it
would be nice to avoid a low-level arena implementation headaches unless
they prove to be worthwhile. Which they haven't really so far...
Unfortunately this comes with another code cost, I think due to the
number of counters needed to keep track of separate attr/data/buf
allocations. At least stack showed a slight improvement:
code stack
before: 33844 2824
after: 33872 (+0.1%) 2816 (-0.3%)
The idea is for cases where we need to incrementally allocate attrs +
context, to allocate from both sides of a statically allocated attr
array. This keeps all of the attr-list state in one place, simplifying
state allocation:
.---+---+---+---.
| attr |
+---+---+---+---+
| attr ----------.
+---+---+---+---+ |
| attr --------. |
+---+---+---+---+ | |
| | | | |
| v | | |
| | | |
| ^ | | |
| | | | |
+---+---+---+---+ | |
| data | <-' |
+---+---+---+---+ |
| encoded bptr | <---'
'---+---+---+---'
This is especially useful for the non-terminating tail-recursive
lfsr_btree_commit_, which needs to pass this state through a function
call.
Unfortunately, to make this work we needed to implement more-or-less a
full arena allocator, complete with annoying alignment handling. alignof
isn't even available in C99, so we needed a few more intrinsics:
- LFS_ALIGNOF(t) - Alignment of type t
- LFS_ALIGNEDSIZEOF(t) - Necessary size to force alignment for t
- LFS_MIN(a, b) - Compile-time min
- LFS_MAX(a, b) - Compile-time max
Technically only LFS_ALIGNOF was required, but the others are nice to
have. LFS_MIN/LFS_MAX is also useful anywhere you need to calculate
complicated compile-time sizes.
At least in C11 we get alignof, so we won't need compiler extensions/
hacks for this in the future...
---
Unfortunately this ended up a net-negative. Pushing up the code/stack
cost to near pre-cat levels:
code stack
before cat: 33856 2824
before scratch: 33812 (-0.1%) 2800 (-0.8%)
after: 33844 (-0.0%) 2824 (+0.0%)
I think the two main culprits are 1. the extra logic needed to calculate
alignment, and 2. wasted stack due to aligning scratch space up to the
nearest lfsr_attr_t.
So now, instead of one data type trying to do everything, we have two:
1. lfsr_data_t - Readable data, either in-RAM or on-disk
2. lfsr_cat_t - Concatenated data for progging, may be either a simple
in-RAM buffer or an indirect list of lfsr_data_ts
This comes from an observation that most lfsr_attr_t datas were either
simple buffers, NULL, or required the indirect concatenated datas
anyways (concatendated file fragments). By separating lfsr_cat_t and
lfsr_data_t, maybe we can save RAM in lfsr_attr_t by not needing the
three words necessary for the less-common disk references.
Note the interesting tradeoff:
Simple in-RAM buffers/NULL decrease by 1 word (4 bytes):
lfsr_data_t lfsr_cat_t
.---+---+---+---. .---+---+---+---.
|0| size | => |0| size |
+---+---+---+---+ +---+---+---+---+
| ptr | | ptr |
+---+---+---+---+ '---+---+---+---'
| (unused) |
'---+---+---+---'
'-------.-------' '-------.-------'
12 bytes 8 bytes
While on-disk references increase by 2 words (8 bytes):
lfsr_data_t lfsr_cat_t lfsr_data_t
.---+---+---+---. .---+---+---+---. .---+---+---+---.
|1| size | => |1| size | .>|1| size |
+---+---+---+---+ +---+---+---+---+ | +---+---+---+---+
| block | | ptr -------' | block |
+---+---+---+---+ '---+---+---+---' +---+---+---+---+
| off | | off |
'---+---+---+---' '---+---+---+---'
'-------.-------' '-----------------.-----------------'
12 bytes 20 bytes
Unless the on-disk references also need concatenation, in which case
this still saves 1 word (4 bytes).
Note I'm not sure this type split is generalizable to other systems. In
littlefs we can't use recursion, so progging concatenated datas already
required two nested functions, and we happen to never need to read
concatenated data, allowing us to completely omit that functionality. In
other systems, where maybe disk-reference attrs are more common, this
tradeoff may not make sense.
Some other things to note:
- We're also losing the inlined-data representation in this change.
Unfortunately earlier lfsr_data_t measurements showed that this didn't
really contribute much. It saved RAM in name attrs but added quite a
bit of complexity to lfsr_data_t operations.
- By separating simple/cat and RAM/disk, we reduce the abused size bits
from 2-bits down to 1-bit. This doesn't really matter for our current
31/28-bit littlefs impl, but is nice in that it reenables the
theoretical 31/31-bit littlefs impl without in-RAM data-structure
changes.
There are a few temporary hacks that need to be figured out, but this is
already showing code/stack savings. Which is fascinating considering the
new lfsr_cat_* functions and increased temporary allocations:
code stack
before: 33856 2824
after: 33812 (-0.1%) 2800 (-0.8%)
This feels more correct.
*_compact() is notably inconsistent with *_commit/appendattrs(), but I
think this is more a case of attrs/attr_count being a special case.
Code changes, this seems to just be compiler noise. Arg order can affect
quite a bit:
code stack
before: 33876 2824
after: 33856 (-0.1%) 2824 (+0.0%)
Note this is slightly different than cases where we use the sign-bit for
muxing two different types, such as `int err` and `lfsr_srid_t rid`. In
those cases we'd never extract the lower bits of the int
unconditionally.
This leads to fewer casts and I think signals the intention of these
sign-bit-is-flag ints a bit better. We aren't really interpreting these
as signed, and mask out other bits in some cases (lfsr_data_t).
This leads to more code in places, I'm guessing because of C treating
signed overflow as undefined behavior... Maybe this is a good thing:
code stack
before: 33856 2824
after: 33876 (+0.1%) 2824 (+0.0%)
We've been linking in this now-unused CRC table when we don't
need to be:
code stack
before: 33976 2824
after: 33856 (-0.4%) 2824 (+0.0%)
The only catch was it's use in lfs_emubd to provide optional checksums
when debugging. But the actual checksum doesn't matter, so this can be
migrated to crc32c.
This now properly encodes the different eoff/estimate field usage
between the two types.
In theory this could save some RAM, but we don't actually allocate
lfsr_shrub_t anywhere it's not unioned with lfsr_btree_t, so:
code stack
before: 33976 2824
after: 33976 (+0.0%) 2824 (+0.0%)
Now that reproducibility issues with erase_value=-1 (erase=noop) are
fixed, this much more useful to test than erase_value=0x1b. Especially
since erase=noop is filled with so many sharp corners.
These tests already found that we were being too confident with our
leb128/lleb128/tag parsing. Since we need to partially parse unfinished/
old commits, lfsr_dir_read* can easily encounter invalid leb128s during
normal operation. If this happens we should not assert.
Doing things correctly has a bit of a cost:
code stack
before: 33928 2824
after: 33976 (+0.1%) 2824 (+0.0%)
At least we haven't seen any issues with our valid bit invalidating
logic yet.
Because reproducibility is extremely important, emubd always zeros
blocks on the first erase, even when erase_value=-1.
Well, at least it should be. We were correctly zeroing the blocks in
RAM, but if erase_value=-1 we were leaving the disk unzeroed, causing
the disk to fall out of sync.
Fixed by zeroing disk in lfs_emubd_createcfg, even if erase_value=-1.
Also I went ahead and dropped the bd->disk->scratch block. We're already
allocating RAM-backed blocks on erase anyways, so keeping scratch around
doesn't really gain us anything anymore. Now there is just a temporary
allocation in lfs_emubd_createcfg to zero the disk efficiently during
initialization.
It should be legal for block devices to return LFS_ERR_CORRUPT when
erased, this is common on devices with ECC, where the erased-state is
not valid ECC and results in LFS_ERR_CORRUPT.
If anything this is a better indicator than fixed-value erased-state,
but we need to make sure we track this with our ecksums consistently.
This gets a bit arbitrary.
Normally:
valid = m[0] & 0x80
cksum = crc32c(m)
If bd_read returns LFS_ERR_CORRUPT:
valid = 0 & 0x80
cksum = crc32c([])
Yeah, implementing this gets a bit funky, but the code cost is trivial:
code stack
before: 33924 2824
after: 33928 (+0.0%) 2824 (+0.0%)
Note this is only best effort right now, we really need tests over
erased=>LFS_ERR_CORRUPT...
Now that most scripts show relevant cksums, it makes sense for
dbgblock.py to just always show a cksum as well. It's not like this has
any noticable impact on the script's runtime.
Example:
$ ./scripts/dbgblock.py disk -b4096 0
block 0x0, size 4096, cksum e6e3ad25
00000000: 01 00 00 00 80 03 00 08 6c 69 74 74 6c 65 66 73 ........littlefs
00000010: 80 04 00 02 00 00 80 05 00 02 fa 01 80 09 00 04 ................
...
We're using parity a lot more than popc now (actually, now that we don't
use CTZ skip-lists, do we use popc at all?), so it makes sense to the
compiler's __builtin_parity intrinsic when possible.
On some processors parity can be much cheaper than popc. Notably, the
8080 family just includes a parity flag in the set of carry flags that
are implicitly updated on most ALU operations. Though I think this
approach didn't scale, you don't really see parity flags on most >8-bit
architectures...
Unfortunately, ARM thumb, our test arch, does not have a popc or parity
instruction. I guess because thanks to implicit shifts in most
instructions, the tree-reduction solution is surprisingly cheap:
ea80 4010 eor.w r0, r0, r0, lsr #16
ea80 2010 eor.w r0, r0, r0, lsr #8
ea80 1010 eor.w r0, r0, r0, lsr #4
ea80 00c0 eor.w r0, r0, r0, lsr #2
ea80 0050 eor.w r0, r0, r0, lsr #1
f000 0001 and.w r0, r0, #1
Both popc and parity benefit from this (GCC 11):
code
__popcountsi2: 40
__paritysi2: 32 (-20.0%)
So, thumb is not an arch where we see much benefit:
code stack
before: 33908 2824
after: 33924 (+0.0%) 2824 (+0.0%)
Not really sure where the +16 bytes come from, we removed several masks,
so I guess it's just bool vs in compiler noise?
Still, this may be useful for other archs with parity instructions/
hardware.
Turns out these are equivalent:
cksum' = crc32c([d & ~0x80], cksum)
cksum' = crc32c([d], cksum ^ (d & 0x80))
Which is quite nice. The second form is a bit cheaper and works better
in situations where you may have an immutable buffer.
I took the long way to find this and may or may not have brute forced
an xor mask for the valid bit:
crc32c(62 95 e3 fd 00) => c7844d4d
crc32c(00 00 00 00 80) => c7844d4d
But this is equivalent to 00000080 after xoring in the init junk.
If you look at the naive lfs_crc32c impl, the first step is to xor the
first byte, so really xoring any byte will cancel it out of our crc32c.
Code changes, thought this would save more because we can reuse bd
checksumming a bit better... Oh well, at least the theory works:
code stack
before: 33916 2824
after: 33908 (-0.0%) 2824 (+0.0%)
Long story short, rbyd checksums are now fully reproducible. If you
write the same set of tags to any block, you will end up with the same
checksum.
This is actually a bit tricky with littlefs's constraints.
---
The main problem boils down to erased-state. littlefs has a fairly
flexible model for erased-state, and this brings some challenges. In
littlefs, storage goes through 2 states:
1. Erase - Prepare storage for progging. Reads after an erase may return
arbitrary, but consistent, values.
2. Prog - Program storage with data. Storage must be erased and no progs
attempted. Reads after a prog must return the new data.
Note in this model erased-state may not be all 0xffs, though it likely
will be for flash. This allows littlefs to support a wide range of
other storage devices: SD, RAM, NVRAM, encryption, ECC, etc.
But this model also means erased-state may be different from block to
block, and even different on later erases of the same block.
And if that wasn't enough of a challenge, _erased-state can contain
perfectly valid commits_. Usually you can expect arbitrary valid cksums
to be rare, but thanks to SD, RAM, etc, modeling erase as a noop, valid
cksums in erased-state is actually very common.
So how do we manage erased-state in our rbyds?
First we need some way to detect it, since we can't prog if we're not
erased. This is accomplished by the forward-looking erased-state cksum
(ecksum):
.---+---+---+---. \
| commit | |
| | |
| | |
+---+---+---+---+ +-.
| ecksum -------. | | <-- ecksum - cksum of erased state
+---+---+---+---+ | / |
| cksum --------|---' <-- cksum - cksum of commit,
+---+---+---+---+ | including ecksum
| padding | |
| | |
+---+---+---+---+ \ |
| erased | +-'
| | /
. .
. .
You may have already noticed the start of our problems. The ecksum
contains the erased-state, which is different per-block, and our rbyd
cksum contains the ecksum. We need to include the ecksum so we know if
it's valid, but this means our rbyd cksum changes block to block.
Solving this is simple enough: Stop the rbyd's canonical cksum before
the ecksum, but include the ecksum in the actual cksum we write to disk.
Future commits will need to start from the canonical cksum, so the old
ecksum won't be included in new commits, but this shouldn't be a
problem:
.---+---+---+---. . . \ . \ . . . . .---+---+---+---. \ \
| commit | | | | commit | | |
| | | +- rbyd | | | |
| | | | cksum | | | |
+---+---+---+---+ +-. / +---+---+---+---+ | |
| ecksum -------. | | | ecksum | . .
+---+---+---+---+ | / | +---+---+---+---+ . .
| cksum --------|---' | cksum | . .
+---+---+---+---+ | +---+---+---+---+ . .
| padding | | | padding | . .
| | | | | . .
+---+---+---+---+ \ | . . . . . . . +---+---+---+---+ | |
| erased | +-' | commit | | |
| | / | | | +- rbyd
. . | | | | cksum
. . +---+---+---+---+ +-. /
| ecksum -------. | |
+---+---+---+---+ | / |
| cksum ------------'
+---+---+---+---+ |
| padding | |
| | |
+---+---+---+---+ \ |
| erased | +-'
| | /
. .
. .
The second challenge is the pesky possibility of existing valid commits.
We need some way to ensure that erased-state following a commit does not
accidentally contain a valid old commit.
This is where are tag's valid bits come into play: The valid bit of each
tag must match the parity of all preceding tags (equivalent to the
parity of the crc32c), and we can use some perturb bits in the cksum tag
to make sure any tags in our erased-state do _not_ match:
.---+---+---+---. \ . . . . . .---+---+---+---. \ \ \
|v| tag | | |v| tag | | | |
+---+---+---+---+ | +---+---+---+---+ | | |
| commit | | | commit | | | |
| | | | | | | |
+---+---+---+---+ +-----. +---+---+---+---+ +-. | |
|v|p| tag | | | |v|p| tag | | | | |
+---+---+---+---+ / | +---+---+---+---+ / | | |
| cksum | | | cksum | | . .
+---+---+---+---+ | +---+---+---+---+ | . .
| padding | | | padding | | . .
| | | | | | . .
+---+---+---+---+ . . . | . . +---+---+---+---+ | | |
|v---------------- != --' |v------------------' | |
| erased | +---+---+---+---+ | |
. . | commit | | |
. . | | | |
+---+---+---+---+ +-. +-.
|v|p| tag | | | | |
+---+---+---+---+ / | / |
| cksum ----------------'
+---+---+---+---+ |
| padding | |
| | |
+---+---+---+---+ |
|v---------------- != --'
| erased |
. .
. .
New problem! The rbyd cksum contains the valid bits, which contain the
perturb bits, which depends on the erased-state!
And you can't just derive the valid bits from the rbyd's canonical
cksum. This avoids erased-state poisoning, sure, but then nothing in the
new commit depends on the perturb bits! The catch-22 here is that we
need the valid bits to both depend on, and ignore, the erased-state
poisoned perturb bits.
As far as I can tell, the only way around this is to make the rybd's
canonical cksum not include the parity bits. Which is annoying, masking
out bits is not great for bulk cksum calculation...
But this does solve our problem:
.---+---+---+---. \ . . . . . .---+---+---+---. \ \ \ \
|v| tag | | |v| tag | | | o o
+---+---+---+---+ | +---+---+---+---+ | | | |
| commit | | | commit | | | | |
| | | | | | | | |
+---+---+---+---+ +-----. +---+---+---+---+ +-. | | |
|v|p| tag | | | |v|p| tag | | | | . .
+---+---+---+---+ / | +---+---+---+---+ / | | . .
| cksum | | | cksum | | . . .
+---+---+---+---+ | +---+---+---+---+ | . . .
| padding | | | padding | | . . .
| | | | | | . . .
+---+---+---+---+ . . . | . . +---+---+---+---+ | | | |
|v---------------- != --' |v------------------' | o o
| erased | +---+---+---+---+ | | |
. . | commit | | | +- rbyd
. . | | | | | cksum
+---+---+---+---+ +-. +-. /
|v|p| tag | | | o |
+---+---+---+---+ / | / |
| cksum ----------------'
+---+---+---+---+ |
| padding | |
| | |
+---+---+---+---+ |
|v---------------- != --'
| erased |
. .
. .
Note that because each commit's cksum derives from the canonical cksum,
the valid bits and commit cksums no longer contain the same data, so our
parity(m) = parity(crc32c(m)) trick no longer works.
However our crc32c still does tell us a bit about each tag's parity, so
with a couple well-placed xors we can at least avoid needing two
parallel calculations:
cksum' = crc32c(cksum, m)
valid' = parity(cksum' xor cksum) xor valid
This also means our commit cksums don't include any information about
the valid bits, since we mask these out before cksum calculation. Which
is a bit concerning, but as far as I can tell not a real problem.
---
An alternative design would be to just keep track of two cksums: A
commit cksum and a canonical cksum.
This would be much simpler, but would also require storing two cksums in
RAM in our lfsr_rbyd_t struct. A bit annoying for our 4-byte crc32cs,
and a bit more than a bit annoying for hypothetical 32-byte sha256s.
It's also not entirely clear how you would update both crc32cs
efficiently. There is a way to xor out the initial state before each
tag, but I think it would still require O(n) cycles of crc32c
calculation...
As it is, the extra bit needed to keep track of commit parity is easy
enough to sneak into some unused sign bits in our lfsr_rbyd_t struct.
---
I've also gone ahead and mixed in the current commit parity into our
cksum's perturb bits, so the commit cksum at least contains _some_
information about the previous parity.
But it's not entirely clear this actually adds anything. Our perturb
bits aren't _required_ to reflect the commit parity, so a very unlucky
power-loss could in theory still make a cksum valid for the wrong
parity.
At least this situation will be caught by later valid bits...
I've also carved out a tag encoding, LFSR_TAG_PERTURB, solely for adding
more perturb bits to commit cksums:
LFSR_TAG_CKSUM 0x3cpp v-11 cccc -ppp pppp
LFSR_TAG_CKSUM 0x30pp v-11 ---- -ppp pppp
LFSR_TAG_PERTURB 0x3100 v-11 ---1 ---- ----
LFSR_TAG_ECKSUM 0x3200 v-11 --1- ---- ----
LFSR_TAG_GCKSUMDELTA+ 0x3300 v-11 --11 ---- ----
+ Planned
This allows for more than 7 perturb bits, and could even mix in the
entire previous commit cksum, if we ever think that is worth the RAM
tradeoff.
LFSR_TAG_PERTURB also has the advantage that it is validated by the
cksum tag's valid bit before being included in the commit cksum, which
indirectly includes the current commit parity. We may eventually want to
use this instead of the cksum tag's perturb bits for this reason, but
right now I'm not sure this tiny bit of extra safety is worth the
minimum 5-byte per commit overhead...
Note if you want perturb bits that are also included in the rbyd's
canonical cksum, you can just use an LFSR_TAG_SHRUBDATA tag. Or any
unreferenced shrub tag really.
---
All of these changes required a decent amount of code, I think mostly
just to keep track of the parity bit. But the isolation of rbyd cksums
from erased-state is necessary for several future-planned features:
code stack
before: 33564 2816
after: 33916 (+1.0%) 2824 (+0.3%)
The intention behind the quirky encoding was to leverage bit 1 to
indicate if the underlying file type would be backed by the common file
B-tree data structure. Looking forward, there may be several of these
types, compressed files, contiguous files, etc, that for all intents and
purposes are just normal files interpreted differently.
But trying to leverage too many bits like this is probably going to give
us a sparse, awkward, and confusing tag encoding, so I've reverted to a
hopefully more normal encoding:
LFSR_TAG_NAME 0x02tt v--- --1- -ttt tttt
LFSR_TAG_NAME 0x0200 v--- --1- ---- ----
LFSR_TAG_REG 0x0201 v--- --1- ---- ---1
LFSR_TAG_DIR 0x0202 v--- --1- ---- --1-
LFSR_TAG_SYMLINK* 0x0203 v--- --1- ---- --11
LFSR_TAG_BOOKMARK 0x0204 v--- --1- ---- -1--
LFSR_TAG_ORPHAN 0x0205 v--- --1- ---- -1-1
LFSR_TAG_COMPR* 0x0206 v--- --1- ---- -11-
LFSR_TAG_CONTIG* 0x0207 v--- --1- ---- -111
* Hypothetical
Note the carve-out for the hypothetical symlink tag. Symlinks are
actually incredibly low in the priority list, but they are also
the only current hypothetical file type that would need to be exposed to
users. Grouping these up makes sense.
This will get a bit messy if we ever end up with a 4th user-facing type,
but there isn't any in POSIX at least (ignoring non-fs types, socket,
fifo, character, block, etc).
The gap also helps line things up so reg/orphan are a single bit flip,
and the non-user facing types all share a bit.
This had no impact on code size:
code stack
before: 33564 2816
after: 33564 (+0.0%) 2816 (+0.0%)
These b*/m* struct tags have a common pattern that would be good to
emphasize in the encoding. The later struct tags get a bit more messy as
they leave space for future possible extensions.
New encoding:
LFSR_TAG_STRUCT 0x03tt v--- --11 -ttt ttrr
LFSR_TAG_DATA 0x0300 v--- --11 ---- ----
LFSR_TAG_BLOCK 0x0304 v--- --11 ---- -1rr
LFSR_TAG_BSHRUB 0x0308 v--- --11 ---- 1---
LFSR_TAG_BTREE 0x030c v--- --11 ---- 11rr
LFSR_TAG_MROOT 0x0310 v--- --11 ---1 --rr
LFSR_TAG_MDIR 0x0314 v--- --11 ---1 -1rr
LFSR_TAG_MSHRUB* 0x0318 v--- --11 ---1 1---
LFSR_TAG_MTREE 0x031c v--- --11 ---1 11rr
LFSR_TAG_DID 0x0320 v--- --11 --1- ----
LFSR_TAG_BRANCH 0x032c v--- --11 --1- 11rr
* Hypothetical
Note that all shrubs currently end with 1---, and all btrees, including
the awkward branch tag, end with 11rr.
This had no impact on code size:
code stack
before: 33564 2816
after: 33564 (+0.0%) 2816 (+0.0%)
Unfortunately block-level erased-state checksums (becksums) don't really
work as intended.
An invalid becksum _does_ signal that a prog has been attempted, but a
valid becksum does _not_ prove that a prog has _not_ been attempted.
Rbyd ecksums work, but only thanks to a combination of prioritizing
valid commits and the use of perturb bits to force erased-state changes.
It _is_ possible to end up with an ecksum collision, but only if you
1. lose power before completing a commit, and 2. end up with a
non-trivial crc32c collision. If this does happen, at the very least the
resulting commit will likely end up corrupted and thrown away later.
Block-level becksums, at least as originally designed, don't have either
of these protections. To make matters worse, the blocks these becksums
reference contain only raw user data. Write 0xffs into a file and you
will likely end up with a becksum collision!
This is a problem for a couple of reasons:
1. Progging multiple times to erased-state is likely to result in
corrupted data, though this is also likely to get caught with
validating writes.
Worst case, the resulting data looks valid, but with weakened data
retention.
2. Because becksums are stored in the copy-on-write metadata of the
file, attempting to open a file twice for writing (or more advanced
copy-on-write operations in the future) can lead to a situation where
a prog is attempted on _already committed_ data.
This is very bad and breaks copy-on-write guarantees.
---
So clearly becksums are not fit for purpose and should be dropped. What
can we replace them with?
The first option, implemented here, is RAM-tracked erased state. Give
each lfsr_file_t its own eblock/eoff fields to track the last known good
erased-state. And before each prog, clear eblock/eoff so we never
accidentally prog to the same erased-state twice.
It's interesting to note we don't currently clear eblock/eoff in all
file handles, this is ok only because we don't currently share
eblock/eoff across file handles. Each eblock/eoff is exclusive to the
lfsr_file_t and does not appear anywhere else in the system.
The main downside of this approach is that, well, the RAM-tracked
erase-state is only tracked in RAM. Block-level erased-state effectively
does not persist across reboots. I've considered adding some sort of
per-file erased-state tracking to the mdir that would need to be cleared
before use, but such a mechanism ends up quite complicated.
At the moment, I think the best second option is to put erased-state
tracking in the future-planned bmap. This would let you opt-in to
on-disk tracking of all erased-state in the system.
One nice thing about RAM-tracked erased-state is that it's not on disk,
so it's not really a compatibility concern and won't get in the way of
additional future erased-state tracking.
---
Benchmarking becksums vs RAM-tracking has been quite interesting. While
in theory becksums can track much more erased-state, it's quite unlikely
anything but the most recent erased-state actually ends up used. The end
result is no real measurable performance loss, and actually a minor
speedup because we don't need to calculate becksums on every block
write.
There are some pathological cases, such as multiple write heads, but
these are out-of-scope right now (note! multiple explicit file handles
currently handle this case beautifully because we don't share
eblock/eoff!)
Becksums were also relatively complicated, and needed extra scaffolding
to pass around/propagate as secondary tags alongside the primary bptr.
So trading these for RAM-tracking also gives us a nice bit of code/stack
savings, albeit at a 2-word RAM cost in lfsr_file_t:
code stack structs
before: 33888 2864 1096
after: 33564 (-1.0%) 2816 (-1.7%) 1104 (+0.7%)
lfsr_file_t before: 104
lfsr_file_t after: 112 (+7.7%)
Whoops. An unfortunate typo that went unnoticed. This should be a
64-byte (16-int) array, not a 64-int array.
A free 192 byte savings!
code stack
before: 34076 2864
after: 33888 (-0.6%) 2864 (+0.0%)
I think this makes a bit more sense.
I think the original reasoning for -x/--cksum was to match -x/--device
in dbgrbyd.py, but that flag no longer exists. This could go all the way
back to matching --xsum at some point, but I'm not sure.
Common hash related utils, sha256sum, md5sum, etc, use -c/--check to
validate their hash, so that's sort of prior art?