littlefs

Author	SHA1	Message	Date
Christopher Haster	03ea2e6ac5	Tweaked cov.py, summary.py, to render fraction percents as notes This matches how diff percentages are rendered, and simplifies the internal table rendering by making Frac less of a special case. It also allows for other type notes in the future. One concern is how all the notes are shoved to the side, which may make it a bit harder to find related percentages. If this becomes annoying we should probably look into interspersing all notes (including diff percentages) between the relevant columns. Before: function lines branches lfsr_rbyd_appendattr 230/231 99.6% 172/192 89.6% lfsr_rbyd_p_recolor 33/34 97.1% 11/12 91.7% lfs_alloc 40/42 95.2% 21/24 87.5% lfsr_rbyd_appendcompaction 54/57 94.7% 39/42 92.9% ... After: function lines branches lfsr_rbyd_appendattr 230/231 172/192 (99.6%, 89.6%) lfsr_rbyd_p_recolor 33/34 11/12 (97.1%, 91.7%) lfs_alloc 40/42 21/24 (95.2%, 87.5%) lfsr_rbyd_appendcompaction 54/57 39/42 (94.7%, 92.9%) ...	2024-05-18 13:00:15 -05:00
Christopher Haster	827dddf62b	Fixed make cov lfs.t.gcda -> lfs.t.a.gcda This was missed when changing the behavior of test/bench suffixes earlier. Easy fix.	2024-05-18 13:00:15 -05:00
Christopher Haster	1d88fa9864	In scripts -d/--diff, show either all percentages or none Previously, with -d/--diff, we would only show non-zero percentages. But this was ambiguous/confusing when dealing with multiple results (stack.py, summary.py, etc). To help with this, I've switched to showing all percentages unless all percentages are zero (no change). This matches the -d/--diff row-hiding logic, so by default all rows should show all percentages. Note -p/--percent did not change, as it already showed all percentages all of the time.	2024-05-18 13:00:15 -05:00
Christopher Haster	7a7da9680a	Avoid O(n^2) folding in summary.py Noticed weird slowness when summarizing test results by suite vs case. Turns out the way we accumulate results by overloading Python's __add__ quickly leads to O(n^2) behavior as we repeatedly concatenate increasingly large lists. Instead of doing anything sane, I've added a second, immutable length to each list such that we can opportunistically reuse/mutate/append lists in __add__. The end result should be O(n) most of the time. Observe: lines bytes test.csv: 537749 64551874 62MiB ./scripts/summary.py test.csv -ftest_time -S before after -bcase: 0m51.772s 0m9.302s (-82.0%) -bsuite: 10m29.067s 0m9.357s (-98.5%)	2024-05-18 13:00:15 -05:00
Christopher Haster	4920cb092c	Fixed summary.py's float diff rendering precision The internal Float type was incorrectly inheriting the diff rendering from Int, which casts to, well, an int.	2024-05-18 13:00:15 -05:00
Christopher Haster	bd4a5e5ab3	Tried to better budget test runtime The main idea here is that diverse tests are better than many similar tests. Sure, if we throw fuzz tests at the system all day we'll eventually find more bugs, but if a developer is in the loop that time is going to be better spent writing specific tests targeting the fragile parts of the system. And don't worry, we can still throw fuzz tests at the system all day by specifying explicit seeds with -DSEED=blah. Changes: - Limited dir-related powerloss fuzz testing to N <= 16. These tests were the biggest culprit of excessive test runtime, requiring O(n^2) redundant operations to recover from powerlosses (they just replay the full sequence on powerloss). - As a tradeoff, bumped most fuzz tests to a minimum of 20 seeds. The big exception being the test_fwrite tests, which are heavily parameterized and already take the most time to run. Each parameter combination also multiplies the effective number of seeds, so increasing the number of base seeds will probably have diminishing returns. - Limited test_fwrite_reversed to SIZE <= 41024CHUNK. Writing a file backwards is just about the worst way you could write a file, since all buffering/coalescing expect writes to eventually make forward progress. On the flip side, because it's uncommon, writing a file backwards is also a great way to find bugs. But at some point a compromise needs to be made. Impacted test runtimes: case otime ntime dtime test_btree_push_fuzz 0.3 0.5 +0.2 (+60.2%) test_btree_push_sparse_fuzz 0.4 3.3 +2.9 (+720.4%) test_btree_update_fuzz 0.4 0.9 +0.6 (+141.6%) test_btree_update_sparse_fuzz 0.5 4.5 +4.1 (+857.4%) test_btree_pop_fuzz 0.6 2.3 +1.7 (+314.7%) test_btree_pop_sparse_fuzz 1.2 5.7 +4.4 (+356.2%) test_btree_split_fuzz 0.5 1.4 +0.8 (+150.2%) test_btree_split_sparse_fuzz 0.4 5.6 +5.1 (+1163.2%) test_btree_find_fuzz 0.5 0.7 +0.2 (+50.7%) test_btree_find_sparse_fuzz 1.0 3.0 +2.0 (+189.8%) test_btree_traversal_fuzz 0.6 2.3 +1.6 (+260.4%) test_dirs_mkdir_many 3.3 2.1 -1.3 (-37.8%) test_dirs_mkdir_many_backwards 3.5 2.1 -1.4 (-39.9%) test_dirs_mkdir_fuzz 115.3 106.4 -8.9 (-7.7%) test_dirs_rm_many 283.9 76.8 -207.0 (-72.9%) test_dirs_rm_many_backwards 216.1 80.6 -135.5 (-62.7%) test_dirs_rm_fuzz 647.0 68.5 -578.5 (-89.4%) test_dirs_mv_many 14.2 15.4 +1.1 (+7.9%) test_dirs_mv_many_backwards 16.5 14.5 -2.1 (-12.5%) test_dirs_mv_fuzz 1932.5 156.7 -1775.8 (-91.9%) test_dirs_general_fuzz 561.9 74.5 -487.4 (-86.7%) test_dread_recursive_rm 336.6 46.2 -290.4 (-86.3%) test_dread_recursive_mv 55.5 44.6 -11.0 (-19.8%) test_fsync_rrrr_fuzz 0.4 0.3 -0.1 (-18.4%) test_fsync_wrrr_fuzz 8.0 12.4 +4.5 (+56.0%) test_fsync_wwww_fuzz 13.2 33.4 +20.2 (+152.6%) test_fsync_wwrr_fuzz 5.4 50.9 +45.5 (+841.6%) test_fsync_rwrw_fuzz 2.4 8.4 +6.0 (+253.9%) test_fsync_rwrw_sparse_fuzz 3.2 7.5 +4.2 (+129.9%) test_fsync_rwtfrwtf_sparse_fuzz 6.1 8.5 +2.4 (+39.3%) test_fsync_drrr_fuzz 11.8 9.2 -2.6 (-21.8%) test_fsync_wddd_fuzz 9.3 11.9 +2.6 (+28.0%) test_fsync_rwdrwd_fuzz 1.6 33.1 +31.5 (+1963.4%) test_fsync_rwdrwd_sparse_fuzz 0.3 1.8 +1.4 (+418.8%) test_fsync_rwtfdrwtfd_sparse_fuzz 0.3 1.1 +0.8 (+260.2%) test_fwrite_reversed 728.5 345.2 -383.3 (-52.6%) TOTAL 7587.5 3792.3 -3795.2 (-50.0%)	2024-05-18 13:00:09 -05:00
Christopher Haster	a5fe2706bd	Added runtime measurements to test.py -o/--output Now that we have ~20 minutes of tests, it's good to know _why_ the tests take ~20 minutes, and if this time is being spent well. This adds the field test_time to test.py's -o/--output, which reports the runtime of each test in seconds. This can be organized by suite, case, etc, with our existing csv scripts. Note I've limited the precision to only milliseconds (%.6f). Realistically, this is plenty of precision, and with the number of tests we have extra digits can really add up! lines bytes test.csv before: 525593 58432541 56MiB test.csv full precision: 525593 (+0.0%) 69817693 67MiB (+19.5%) test.csv milli precision: 525593 (+0.0%) 63162935 60MiB (+8.1%) It still takes a bit of time to process this (50.3s), but now we can see the biggest culprits of our ~20 minute test time: $ ./scripts/summary.py test.csv -bcase -ftest_time -S case test_time ... test_fwrite_hole_compaction 74.4 test_fwrite_incr 109.7 test_dirs_mkdir_fuzz 115.3 test_fwrite_overwrite_compaction 132.4 test_rbyd_fuzz_append_removes 134.0 test_rbyd_fuzz_mixed 136.3 test_rbyd_fuzz_sparse 137.4 test_fwrite_w_seek 144.1 test_rbyd_fuzz_create_deletes 144.8 test_dirs_rm_many_backwards 208.4 test_dirs_rm_many 273.8 test_fwrite_fuzz_unaligned 283.2 test_dread_recursive_rm 316.7 test_fwrite_fuzz_aligned 551.0 test_dirs_general_fuzz 552.8 test_dirs_rm_fuzz 632.7 test_fwrite_reversed 719.0 test_dirs_mv_fuzz 1984.8 TOTAL 7471.3 Note this machine has 6 cores, 12 hthreads, 7471.3/60/6 => 20.8m, which is why I don't run these tests single threaded.	2024-05-11 23:37:59 -05:00
Christopher Haster	a9e3cad90a	Adopted explicit buffers for low/mid-level attrs It's really frustrating that it's impossible to create an uninitialized expression with the scope of a compound-literal... (I'm going to ignore that this is technically possible with alloca.) The lack of uninitialized compound-literals forces each of our attribute lists to make a decision: 1. Use an implicit buffer and pay for zero-initialization? or 2. Use an explicit buffer, adding code noising and risking out-of-date buffer sizes. As a compromise, this commit adopts explicit buffers in most of the low/mid-level layers. Where the code is already pretty noisy, but also heavily scrutinized and iterated over to reduce code/stack costs. This leaves the high-level layers with the hopefully safer and more readable implicit buffers. You can see this zero initializing has a surprisingly high code cost, for what is otherwise a noop: code stack before: 33828 2632 after: 33656 (-0.5%) 2632 (+0.0%)	2024-05-10 23:26:00 -05:00
Christopher Haster	b36663f9f3	Tweaked lfsr_attr_isnoop to assert on delta != 0 Now it is fit for purpose and can replace the explicit tag comparison + assert in lfsr_rbyd_appendattr. Previously we had to check if delta==0, but now we just assert that delta!=0 is invalid for noops. Unfortunately this added a couple bytes of code. The disassembly for lfsr_rbyd_appendattr is all shuffled up, so I guess this is just compiler noise. At least it's better than an explicit delta check: code stack before: 33820 2632 check delta: 33832 (+0.0%) 2632 (+0.0%) assert delta: 33828 (+0.0%) 2632 (+0.0%)	2024-05-10 22:52:45 -05:00
Christopher Haster	60179e8f56	Replaced macro array-lits with struct-lits to force lvalues Turns out temporary struct-literals have a slightly better code/stack footprint than array-literals. I guess because nuances around arrays in C can cause problems for optimization passes? This makes forcing lvalues for macro consistency much more appealing: code stack sometimes rvalues: 33780 2640 array lvalues (before): 33868 (+0.3%) 2640 (+0.0%) struct lvalues (after): 33820 (+0.1%) 2632 (-0.3%)	2024-05-10 18:42:17 -05:00
Christopher Haster	216881ede3	Changed all LFSR_DATA/ATTR macros to create lvalues I think what may be going on with the unexpected stack cost related to struct passing, is something to do with scoping and how it interacts with function inlining + shrink wrapping. Compound-literals have a scope limited by the current statement, and while temporary structs _should_ have a scope limited to the current expressions, maybe this scope is getting messed up due to function inlining? Still smells like a compiler bug, but if this is true, wrapping the struct-generating function calls with compound-literals should be more robust at preventing unexpected stack increases in the future. As a plus, this makes all LFSR_DATA/ATTR macros lvalues, which is nice for consistency. --- Unfortunately, it does seem like GCC 11 is not able to elide moving compound-literals all that well. Repeatedly nesting trivial compound-literals results in a measurable increase in code cost, even though it should theoretically be a noop with optimizations. This results in an unfortunate code size increase: code stack before: 33780 2640 after: 33868 (+0.3%) 2640 (+0.0%) But at some point you have to give up trying to work around insufficiencies in the compiler. I'll take 100 bytes of code over 100 bytes of stack any day.	2024-05-10 18:16:16 -05:00
Christopher Haster	cd22c0d68b	Aggressively cleaned up/reworked lfsr_attr_t, consumed lfsr_cat_t This turned into a sort of system-wide refactor based on learned knowledge of what we can do with lfsr_attr_t. The big changes: - Reverted LFSR_ATTR to mainly take lfsr_data_t again, keeping lfsr_data_t as the default data representation in the codebase. Now that we know LFSR_ATTR_CAT_ still provides concatenation mechanics, and LFSR_ATTR_ provides a way to edit in-flight lfsr_attr_ts. - Dropped lfsr_cat_t, replaced with explicit const void* + uint16_t, tried to limit to low-level operations and prefer passing aroud lfsr_attr_t and lfsr_data_t at a high-level. Note this cat + cat_count pair is quite similar to the common attrs + attr_count and buffer + size arguments. - Adopted lfsr_attr_t more in mid-level functions, lfsr_rbyd_appendattr, lfsr_rbyd_appendcompactattr, lfsr_file_carve, etc. This is a bit more ergonomical, allows for use of LFSR_ATTR* macros, and in theory might even save a bit of stack. Unfortunately this seems to have resulted in a net hit to code cost, though I still think it's worth it for the internal ergonomics: code stack before: 33652 2624 after: 33780 (+0.4%) 2640 (+0.4%) Investigating further suggests this may just be the result of compiler noise and changes to argument placement. lfsr_attr_t does touch a lot of code... It's interesting to note the adoption of lfsr_attr_t in lfsr_rbyd_appendattr* and friends prevents their transformation into .isra functions, though this doesn't seem to impact code cost too much: function (5 added, 5 removed) osize nsize dsize lfsr_cat_size - 48 +48 (+100.0%) lfsr_file_carve - 1600 +1600 (+100.0%) lfsr_rbyd_appendattr - 2120 +2120 (+100.0%) lfsr_rbyd_appendattr_ - 244 +244 (+100.0%) lfsr_rbyd_appendcompactattr - 68 +68 (+100.0%) lfsr_rbyd_appendcompactrbyd 144 152 +8 (+5.6%) lfsr_file_truncate 298 314 +16 (+5.4%) lfsr_mdir_commit__ 1056 1112 +56 (+5.3%) lfsr_mdir_compact__ 502 526 +24 (+4.8%) lfsr_rbyd_appendattrs 132 138 +6 (+4.5%) lfsr_file_fruncate 386 402 +16 (+4.1%) lfsr_data_frombtree 84 86 +2 (+2.4%) lfsr_rbyd_appendcksum 512 520 +8 (+1.6%) lfsr_file_opencfg 572 580 +8 (+1.4%) lfsr_rename 608 616 +8 (+1.3%) lfsr_mkdir 500 504 +4 (+0.8%) lfsr_bd_prog 278 280 +2 (+0.7%) lfsr_mdir_commit 2364 2360 -4 (-0.2%) lfsr_bshrub_commit 716 712 -4 (-0.6%) lfsr_file_sync 526 514 -12 (-2.3%) lfsr_file_flush_ 1868 1820 -48 (-2.6%) lfsr_remove 456 436 -20 (-4.4%) lfsr_fs_fixgrm 168 160 -8 (-4.8%) lfsr_cat_size.isra.0 42 - -42 (-100.0%) lfsr_file_carve.isra.0 1596 - -1596 (-100.0%) lfsr_rbyd_appendattr.isra.0 2088 - -2088 (-100.0%) lfsr_rbyd_appendattr_.isra.0 232 - -232 (-100.0%) lfsr_rbyd_appendcompactattr.isra.0 56 - -56 (-100.0%) TOTAL 33652 33780 +128 (+0.4%)	2024-05-10 15:43:08 -05:00
Christopher Haster	0fd955edb7	Prefer tag/size outside of union where possible If we have control of the struct, such as in lfsr_data_t and lfsr_cat_t, moving the common tag outside of the union avoids naming ambiguities. Counter-example: This doesn't work for lfsr_bshrub_t, since the contents of that union are also used as separate types elsewhere. Fortunately the common initial sequence union rules kick in here. No code changes, which is good: code stack before: 33652 2624 after: 33652 (+0.0%) 2624 (+0.0%)	2024-05-10 01:58:54 -05:00
Christopher Haster	643bf5b3e0	Changed lfsr_attr_* helper functions to take lfsr_attr_t by value Now that lfsr_attr_t is "small", or at least the same size as lfsr_data_t, it makes sense to change the helper functions to take lfsr_attr_t by value for consistency. These should all be inlined anyways. It's interesting to note there _are_ appendattr/progattr functions, but these don't take lfsr_attr_t directly since we usually do some last-minute modification to the attr's weight/tag. Cost cost is mostly unchanged, actually shaves off a few bytes, which is a good sign: code stack before: 33664 2624 after: 33652 (-0.0%) 2624 (+0.0%)	2024-05-10 00:34:09 -05:00
Christopher Haster	0eb64d9f10	Brought back compound-literals in inline functions Compound-literals weren't the culprit after all! It was... RVO interactions with inlined function arguments? To be honest I still don't quite understand what's going on, but I present to you this madness: code stack before: 33664 2624 after: 33664 (+0.0%) 2624 (+0.0%)	2024-05-09 18:52:13 -05:00
Christopher Haster	f39057d2e1	Forced RVO in LFSR_CAT_* macros somehow I think I'm understanding a bit more how RVO interacts with inline functions. And by that I mean I'm learning that the way RVO interacts with inline functions is unfortunately very cursed... Just take a look at this diff. This change should be a noop. But somehow it saves 200 bytes of RAM: code stack before: 33684 2824 after: 33664 (-0.1%) 2624 (-7.1%) I think what's happening is passing the result of lfsr_data_from* into lfsr_data_cat is somehow preventing RVO, because the parameter would need to be copied into the right argument slot? (argument registers?) But we really don't need a copy, because lfsr_data_cat should end up inlined. By inserting a compound literal, we force RVO, and all of these unnecessary copies get cleaned up after lfsr_data_cat is inlined. Keep in mind, in a perfect world, lfsr_data_cat should be a noop. But I could be wrong about all of this. It's not really clear what the compiler is doing, and I haven't dived that far into the disassembly...	2024-05-09 18:52:02 -05:00
Christopher Haster	250c1dd57e	Replaced LFSR_CAT_DAT with less-hacky lfsr_data_cat The name is not super important, but note lfsr_data_cat matches lfsr_attr_cat, which is a nice bit of consistency. The main change here is the adoption of correct field assignments instead of a hacky cast forcing lfsr_data_t -> lfsr_cat_t. Tests were passing even with optimizations, but I was concerned about the longevity of this approach. As a plus, we can actually assert on size fitting into a uint16_t thanks to the inline function. Unfortunately, this creates a surprising stack penalty: code stack before: 33756 2624 after: 33684 (-0.2%) 2824 (+7.6%) I've also played around with instead reverting lfsr_data_from* -> lfsr_cat_from*, and providing the inverse lfsr_cat_data, but nothing gets us quite back to LFSR_CAT_DAT stack: code stack before: 33756 2624 lfsr_data_cat: 33684 (-0.2%) 2824 (+7.6%) lfsr_cat_data: 33872 (+0.3%) 2736 (+4.3%) This needs more investigation. Unfortunately I don't think we can revert this, since correctness wins over code/stack costs...	2024-05-09 18:01:10 -05:00
Christopher Haster	85fad999b8	Readopted 16-bit crammed size lfsr_attr_ts Now that compound-literals have been identified as the culprit, we can actually adopt this smaller lfsr_attr_t representation without a random code/stack increase. This limits lfsr_cat_t's size field to 16-bits (15-bit size + 1-bit for concatenated datas), allowing simple small attrs (the most common) to save a word of RAM: lfsr_tag_t lfsr_attr_t .---+---. .---+---+---+---. \| tag \|-----------+--->\| tag \|c\|size \| '---+---' \| +---+---+---+---+ .-\|--->\| delta \| lfsr_srid_t \| \| +---+---+---+---+ .---+---+---+---. \| \| .->\| ptr \| \| delta \|-' \| \| '---+---+---+---' '---+---+---+---' \| \| \| \| lfsr_cat_t \| \| .---+---+---+---. \| \| \|c\|size \|-----------' \| +---+---+---+---+ \| \| ptr \|-----' '---+---+---+---' The non-trivial mapping of lfsr_cat_t to lfsr_attr_t does mean a bit more complexity on lfsr_cat_t access, but now that we figured out the compound-literal cost it seems the compiler is able to mostly elide these. The end result is some nice stack savings: code stack before: 33812 2712 after: 33756 (-0.2%) 2624 (-3.2%)	2024-05-09 18:01:10 -05:00
Christopher Haster	8aebb37b51	Apparently GCC just really hates compound literals I've been fiddling around with our LFSR_ATTR macro to try to understand why making it an inline function costs so much, and it seems like it's not actually the inline function, but the compound literal that is the problem. Specifically, returning a compound literal from an inline function results in surprisingly poor code/stack costs! I don't really know why this happens. Compiler bug/oversight related to lvalues/rvalues? Compound literals interfering with RVO? Unsure. I tried a few other struct initializers just in case it was related to constness, but it seems the problem is the compound literal: Inlined comp-lit: return (lfsr_attr_t){tag, delta, cat}; Inlined const comp-lit: return (const lfsr_attr_t){tag, delta, cat}; Inlined no-init: lfsr_attr_t attr; attr.tag = tag; attr.delta = delta; attr.cat = cat; return attr; Inlined init: lfsr_attr_t attr = {tag, delta, cat}; return attr; Code/stack sizes: code stack macro (before): 33852 2776 inline comp-lit: 34140 (+0.9%) 2760 (-0.6%) inline const comp-list: 34140 (+0.9%) 2760 (-0.6%) inline no-init (after): 33812 (-0.1%) 2712 (-2.3%) inline init: 33812 (-0.1%) 2712 (-2.3%) The good news is this at least offers a route forward for crammed 15-bit attrs. I guess we should also go reasses other uses of compound literals in the codebase...	2024-05-09 18:00:58 -05:00
Christopher Haster	f7c3e2f7e1	Changed lfsr_cat_from* back to lfsr_data_from, keep CAT macros When we're not dealing with cats/attrs, it seems useful to still have lfsr_data_from functions that return our general purpose lfsr_data_t type. Added LFSR_CAT_DAT to cheaply convert from lfsr_data_t -> lfsr_cat_t (at least for simple buffers), and used this to keep the LFSR_CAT_* macros, which are useful for attr-list construction. Unfortunately, because lfsr_data_t is 3-words vs lfsr_cat_t 2-words, this does add both code and stack cost: code stack before: 33672 2744 after: 33852 (+0.5%) 2776 (+1.2%) It's interesting to note this is _not_ because of any LFSR_CAT_* usage! I tested this explicitly and lfsr_data_from* -> LFSR_CAT_DAT adds no cost over the previous lfsr_cat_from* functions. A win for GCC. This cost only comes from the direct usage of the returns lfsr_data_t types in our grm handling and branch -> btree encoding. Still it's an annoying cost... Maybe this should be reverted? The nuances of lfsr_cat_t vs lfsr_data_t is a bit annoying.	2024-05-09 14:47:28 -05:00
Christopher Haster	d11106a898	Extended LFSR_CAT_* -> LFSR_cat_*_ for implicit/explicit memory So, for example, these are equivalent: lfsr_cat_t cat = LFSR_CAT_BPTR(bptr); uint8_t buf[LFSR_BPTR_DSIZE]; lfsr_cat_t cat = LFSR_CAT_BPTR_(bptr, buf); The first leads to more readable code, but of course sometimes you need explicit memory allocations. This replaces lfsr_cat_frombptr, etc, though those functions are still available. This name change is more relevant for LFSR_CAT_DATA/DATAS, which involve bit more complicated macros.	2024-05-09 14:16:31 -05:00
Christopher Haster	2383e7f9f9	Reverted most of the crammed 15-bit lfsr_attr_t representation I think this was becoming too over-engineered. That and if you're fighting the compiler too much, sometimes it's best to just let the compiler win. There is also the minor benefit of not needing to worry about uint16_t overflows in lfsr_cat_t. One concern is it's easy for missed lfsr_data_t indirections to go unnoticed. This reverts lfsr_attr_t from 3 words -> 4 words, but allows easy mapping to the lfsr_cat_t part of the lfsr_attr_t: lfsr_tag_t lfsr_attr_t .---+---. .---+---+---+---. . . .---+---+---+---. \| tag \|--------------->\| tag \| \| \| tag \| \| '---+---' +---+---+---+---+ +---+---+---+---+ .----->\| delta \| \| delta \| lfsr_srid_t \| +---+---+---+---+ +---+---+---+---+ .---+---+---+---. \| .--->\|c\| size \| \| cat \| \| delta \|-' \| +---+---+---+---+ \| \| '---+---+---+---' \| .->\| ptr \| \| \| \| \| '---+---+---+---' . . '---+---+---+---' lfsr_cat_t \| \| .---+---+---+---. \| \| \|c\| size \|---' \| +---+---+---+---+ \| \| ptr \|-----' '---+---+---+---' This doesn't impact stack as much as you would expect due to compiler overhead around inline functions operating on multi-word structs: before 15-bit: 33672 2776 best 15-bit: 33900 (+0.7%) 2736 (-1.4%) prev 15-bit: 34486 (+2.4%) 2912 (+4.9%) after: 33672 (+0.0%) 2744 (-1.2%) I think the only big thing not reverted was switching lfsr_cat_t from storing the sum of the lfsr_data_ts sizes to just storing the number of lfsr_data_ts. This is cheaper to initialize (because inline functions are weirdly costly!), cheaper to iterate over, and we don't actually need lfsr_cat_size that often, unlike lfsr_data_size. Eventually we'll probably need to revisit this to optimize 31/15-bit, etc, littlefs configurations. But that's a problem for another day...	2024-05-09 14:16:31 -05:00
Christopher Haster	7e4d8c1df2	Attempted another implementation of crammed lfsr_attr_ts The idea here was to carve out explicit space for the lfsr_attr_t tag in lfsr_cat_t, so that when we create lfsr_attr_ts we don't need multiple references to lfsr_cat_t and can revert back to a macro: lfsr_tag_t lfsr_attr_t .---+---. .---+---+---+---. . . .---+---+---+---. \| tag \|-------. .----->\| delta \| \| delta \| '---+---' \| \| +---+---+---+---+ +---+---+---+---+ '-\|-+--->\| tag \|c\|size \| \| cat \| lfsr_srid_t \| \| +---+---+---+---+ \| \| .---+---+---+---. \| \| .->\| ptr \| \| \| \| delta \|-' \| \| '---+---+---+---' . . '---+---+---+---' '---+---+---+---' \| \| \| \| lfsr_cat_t \| \| .---+---+---+---. \| \| \| (tag) \|c\|size \|---' \| +---+---+---+---+ \| \| ptr \|-----' '---+---+---+---' But as a part of creating the lfsr_attr_t we need to mutate the tag in lfsr_cat_t, which ended up still needing a static inline function, and ended up making the code/stack size even worse! code stack before: 33900 2736 after: 34486 (+1.7%) 2912 (+6.4%) I was hoping to leverage the last-update rule of C99's designated initializers, but this weird rule didn't quite work how I expected. The compiler is free to omit earlier struct initializers, even if a later initializer only partially initializes the struct.	2024-05-09 14:16:31 -05:00
Christopher Haster	2ee6750f2e	Adopted crammed 15-bit size lfsr_attr_t representation Another interesting observation about lfsr_cat_t: We rarely actually need to represent the full range of data: - lfsr_file_write:block - Block writes call lfsr_bd_prog directly, don't need to be represented as attrs. - lfsr_file_write:fragment - Fragments already need the full concatenated representation because of, uh, potential concatenation. - lfsr_file_sync:sprouts - Inlined sprouts may use the full range. Fortunately we can union an lfsr_data_t with the btree buffer, so no extra stack cost. - lfsr_getuattr/sattr (planned) - User/sys attributes may use the full range. But these will probably not be on the stack hot-path. Most attrs that use the simple buffer lfsr_cat_t representation are used to encode internal structs, such as leb128s, ecksum, bptr, mptr, etc. The largest of these right now is our bptr encoding, at 21 bytes, so these easily fit in a short or a byte. The choice of 15-bits (reserving one bit for cat/buf representation), is convenient as it allows us to fit the cat size next to our 16-bit lfsr_tag_t for free in 32-bit aligned systems: lfsr_attr_t .---+---+---+---. \| tag \|c\|size \| +---+---+---+---+ \| delta \| +---+---+---+---+ \| ptr \| '---+---+---+---' Unfortunately, it seems like C _really_ wants to fight us on this one. Not really because of the struct packing, but because of how we want lfsr_attr_t to interact with lfsr_cat_t: lfsr_tag_t lfsr_attr_t .---+---. .---+---+---+---. \| tag \|-----------+--->\| tag \|c\|size \| '---+---' \| +---+---+---+---+ .-\|--->\| delta \| lfsr_srid_t \| \| +---+---+---+---+ .---+---+---+---. \| \| .->\| ptr \| \| delta \|-' \| \| '---+---+---+---' '---+---+---+---' \| \| \| \| lfsr_cat_t \| \| .---+---+---+---. \| \| \|c\|size \|-----------' \| +---+---+---+---+ \| \| ptr \|-----' '---+---+---+---' Initializing two fields with one argument is frustratingly impossible in C99 unless you want to duplicate the argument tree, which we _definitely_ don't want to do because this includes the actual data encoding steps. The only option is to use a static inline function. You might say "oh, but static inline costs the same as a macro". But no. Switching to a static inline function heavily penalizes this approach (6% of stack!): code stack macro: 33672 2776 inline: 34512 (+2.5%) 2952 (+6.3%) after: 33900 (+0.7%) 2736 (-1.4%) At least this does result in net stack savings, even with the inline function penalty.	2024-05-09 14:16:31 -05:00
Christopher Haster	e4069ee4fc	Testing LFSR_ATTR as a static inline function I'm committing this temporarily mostly just to record some _very_ interesting code/stack measurements. This explores replacing the LFSR_ATTR macro with an inline function that does the same thing: #define LFSR_ATTR(_tag, _delta, _cat) \ ((const lfsr_attr_t){_tag, _delta, _cat}) vs: #define LFSR_ATTR(_tag, _delta, _cat) \ lfsr_attr(_tag, _delta, _cat) static inline lfsr_attr_t lfsr_attr( lfsr_tag_t tag, lfsr_srid_t delta, lfsr_cat_t cat) { return (lfsr_attr_t){tag, delta, cat}; } The motivation for this is to eventually support more complex lfsr_attr_t layouts. Specifically, it would be nice if we could break up the lfsr_cat_t into separate size/ptr fields. Unfortunately we can't declare temporaries in macros (I wish we had statement expressions), and we really don't want to duplicate the entire cat tree, so an inline function seems like the only way to accomplish this... But static inline functions have the same cost as a macro you say? No. This assumes a perfect compiler. And it's pretty unfair to compiler developers to expect a perfect compiler. To be fair, this is an extremely harsh test. We use LFSR_ATTR _heavily_, which is why it's getting this much scrutiny. lfsr_attr also both takes in a 2-word struct, and returns a _4_-word struct, which probably makes things messy. Still, the results are concerning: code stack macro: 33672 2776 inline: 34512 (+2.5%) 2952 (+6.3%) always_inline: 34512 (+2.5%) 2952 (+6.3%) noinline: 33920 (+0.7%) 2888 (+4.0%) Measured with GCC 11 -mthumb -Os. I also measured with __attribute__((always_inline/noinline)) just to see how that affected things.	2024-05-09 14:16:31 -05:00
Christopher Haster	010c82475f	Crammed LFSR_CAT_NAME to reuse the wasted word in the name lfsr_data_t One annoying thing about lfsr_data_t is when representing in-RAM buffers, the last word of the struct goes completely unused. This commit attempts to save this word of RAM in file names by forcibly (hackily?) truncating the name's lfsr_data_t: .---+---+---+---. . . . . .---+---+---+---. . . . . .---+---+---+---. \|0\| did_size \| \|0\| did_size \| \| did_data \| +---+---+---+---+ +---+---+---+---+ \| \| \| did_ptr ------. \| did_ptr ------. \| \| +---+---+---+---+ \| +---+---+---+---+ \| \| \| \| (unused) \| \| \| (unused) \| \| \| \| +---+---+---+---+ . \| . . +---+---+---+---+ . \| . . +---+---+---+---+ \|0\| name_size \| \| \|0\| name_size \| \| \| name_size \| +---+---+---+---+ \| +---+---+---+---+ \| \| \| \| name_ptr \| \| \| name_ptr \| \| \| \| +---+---+---+---+ \| +---+---+---+---+ \| \| \| \| (unused) \| \| \| did \| <-' \| \| +---+---+---+---+ . \| . . \| \| . . . . '---+---+---+---' \| did \| <-' '---+---+---+---' \| \| '---+---+---+---' Curiously, this didn't seem to save RAM but saved a bit of code cost? I guess this is because 1. file names, despite being very common, don't occur on the stack hot-path that starts at lfsr_file_sync, and 2. while we don't save stack cost, sometimes the reduced stack pressure can reduce stack manipulation instructions: code stack before: 33728 2776 after: 33672 (-0.2%) 2776 (+0.0%) If we look at the per-function stack cost, we can see the expected minor stack savings in most functions, albeit outside of the stack hot-path: function oframe olimit nframe nlimit dframe dlimit lfsr_file_open 16 2040 16 2032 +0 -8 (+0.0%, -0.4%) lfsr_rename 240 2128 232 2120 -8 -8 (-3.3%, -0.4%) lfsr_remove 176 2064 168 2056 -8 -8 (-4.5%, -0.4%) lfsr_file_opencfg 136 2024 128 2016 -8 -8 (-5.9%, -0.4%)	2024-05-09 14:16:31 -05:00
Christopher Haster	ca2d0b980c	Dropped LFSR_CAT_CAT With LFSR_CAT_DATAS for explicit arrays of datas, and this biggest use of concatenated data being a rather explicit construction in lfsr_file_carve, I don't think we really need LFSR_CAT_CAT. The only non-hacky use was to define LFSR_CAT_NAME. But we know names always use exactly 2 datas, so this might as well use LFSR_CAT_DATAS. I am going to use this soapbox to complain a bit about compound struct literals. Why do we need an array declaration to elevate temporary structs to automatic storage duration? I wish you could init a compound literal with the struct itself... ✗ &f() ✓ &(uint32_t){f()} ✓ (uint32_t[]){f()} ✗ &f() ✗ &(lfsr_data_t){f()} :( ✓ (lfsr_data_t[]){f()} Some hacky compound array literals were needed to replace the hacky LFSR_CAT_CATs in lfsr_file_carve for this reason, but I guess it's a hack for a hack so... Code unchanged: code stack before: 33728 2776 after: 33728 (+0.0%) 2776 (+0.0%)	2024-05-09 14:16:31 -05:00
Christopher Haster	77bfcb69ad	Ripped out attr-list data/buf allocators, hand allocated necessary state This attr-list allocator stuff is becoming over-engineered, these allocations really aren't that complex... This may have simplified after removing becksums, but if we need that complexity again we can cross that bridge when we get to it. Hand-allocating, dropping buf_size/data_count tracking, and refactoring lfsr_file_carve to pre-encode the right sibling gives us a nice bit of code/stack savings: code stack before: 33796 2808 after: 33728 (-0.2%) 2776 (-1.1%)	2024-05-09 14:16:31 -05:00
Christopher Haster	78b92cc954	Replaced attr-list datas/buf arrays with union lfsr_data_t datas[d]; => union { lfs_size_t data_count; lfsr_data_t data; uint8_t buf[b]; uint8_t buf[b']; lfs_size_t buf_size; } datas[d+b]; lfs_size_t data_count; This trades off extra bookeeping (data_count + buf_size vs data_count) for less-tight stack overhead. But this also saves a significant amount of RAM in lfsr_file_carve, where we have exclusive fragments/bptrs for our left and right siblings. So the end stack cost/savings mostly cancel out. The end result seems like a net benefit for code cost: code stack before: 33872 2816 after: 33796 (-0.2%) 2808 (-0.3%)	2024-05-09 14:16:31 -05:00
Christopher Haster	0509fba9b9	Replaced attr-list arenas with three independent arrays lfsr_attr_t attrs[adb]; => lfsr_attr_t attrs[a]; lfs_size_t attr_count; lfs_size_t attr_count; lfs_size_t attr_scratch; lfsr_data_t datas[d]; lfs_size_t data_count; uint8_t buf[b]; lfs_size_t buf_size; This mostly reverts the allocator scaffolding needed for the attr-list arenas (LFS_ALIGNOF, etc). This is the main draw of this change, as it would be nice to avoid a low-level arena implementation headaches unless they prove to be worthwhile. Which they haven't really so far... Unfortunately this comes with another code cost, I think due to the number of counters needed to keep track of separate attr/data/buf allocations. At least stack showed a slight improvement: code stack before: 33844 2824 after: 33872 (+0.1%) 2816 (-0.3%)	2024-05-09 14:16:31 -05:00
Christopher Haster	8f3036f1e5	Unified attr-list context into little attr arenas The idea is for cases where we need to incrementally allocate attrs + context, to allocate from both sides of a statically allocated attr array. This keeps all of the attr-list state in one place, simplifying state allocation: .---+---+---+---. \| attr \| +---+---+---+---+ \| attr ----------. +---+---+---+---+ \| \| attr --------. \| +---+---+---+---+ \| \| \| \| \| \| \| \| v \| \| \| \| \| \| \| \| ^ \| \| \| \| \| \| \| \| +---+---+---+---+ \| \| \| data \| <-' \| +---+---+---+---+ \| \| encoded bptr \| <---' '---+---+---+---' This is especially useful for the non-terminating tail-recursive lfsr_btree_commit_, which needs to pass this state through a function call. Unfortunately, to make this work we needed to implement more-or-less a full arena allocator, complete with annoying alignment handling. alignof isn't even available in C99, so we needed a few more intrinsics: - LFS_ALIGNOF(t) - Alignment of type t - LFS_ALIGNEDSIZEOF(t) - Necessary size to force alignment for t - LFS_MIN(a, b) - Compile-time min - LFS_MAX(a, b) - Compile-time max Technically only LFS_ALIGNOF was required, but the others are nice to have. LFS_MIN/LFS_MAX is also useful anywhere you need to calculate complicated compile-time sizes. At least in C11 we get alignof, so we won't need compiler extensions/ hacks for this in the future... --- Unfortunately this ended up a net-negative. Pushing up the code/stack cost to near pre-cat levels: code stack before cat: 33856 2824 before scratch: 33812 (-0.1%) 2800 (-0.8%) after: 33844 (-0.0%) 2824 (+0.0%) I think the two main culprits are 1. the extra logic needed to calculate alignment, and 2. wasted stack due to aligning scratch space up to the nearest lfsr_attr_t.	2024-05-09 14:16:31 -05:00
Christopher Haster	88a098c616	Added lfsr_cat_t to represent concatenated data So now, instead of one data type trying to do everything, we have two: 1. lfsr_data_t - Readable data, either in-RAM or on-disk 2. lfsr_cat_t - Concatenated data for progging, may be either a simple in-RAM buffer or an indirect list of lfsr_data_ts This comes from an observation that most lfsr_attr_t datas were either simple buffers, NULL, or required the indirect concatenated datas anyways (concatendated file fragments). By separating lfsr_cat_t and lfsr_data_t, maybe we can save RAM in lfsr_attr_t by not needing the three words necessary for the less-common disk references. Note the interesting tradeoff: Simple in-RAM buffers/NULL decrease by 1 word (4 bytes): lfsr_data_t lfsr_cat_t .---+---+---+---. .---+---+---+---. \|0\| size \| => \|0\| size \| +---+---+---+---+ +---+---+---+---+ \| ptr \| \| ptr \| +---+---+---+---+ '---+---+---+---' \| (unused) \| '---+---+---+---' '-------.-------' '-------.-------' 12 bytes 8 bytes While on-disk references increase by 2 words (8 bytes): lfsr_data_t lfsr_cat_t lfsr_data_t .---+---+---+---. .---+---+---+---. .---+---+---+---. \|1\| size \| => \|1\| size \| .>\|1\| size \| +---+---+---+---+ +---+---+---+---+ \| +---+---+---+---+ \| block \| \| ptr -------' \| block \| +---+---+---+---+ '---+---+---+---' +---+---+---+---+ \| off \| \| off \| '---+---+---+---' '---+---+---+---' '-------.-------' '-----------------.-----------------' 12 bytes 20 bytes Unless the on-disk references also need concatenation, in which case this still saves 1 word (4 bytes). Note I'm not sure this type split is generalizable to other systems. In littlefs we can't use recursion, so progging concatenated datas already required two nested functions, and we happen to never need to read concatenated data, allowing us to completely omit that functionality. In other systems, where maybe disk-reference attrs are more common, this tradeoff may not make sense. Some other things to note: - We're also losing the inlined-data representation in this change. Unfortunately earlier lfsr_data_t measurements showed that this didn't really contribute much. It saved RAM in name attrs but added quite a bit of complexity to lfsr_data_t operations. - By separating simple/cat and RAM/disk, we reduce the abused size bits from 2-bits down to 1-bit. This doesn't really matter for our current 31/28-bit littlefs impl, but is nice in that it reenables the theoretical 31/31-bit littlefs impl without in-RAM data-structure changes. There are a few temporary hacks that need to be figured out, but this is already showing code/stack savings. Which is fascinating considering the new lfsr_cat_* functions and increased temporary allocations: code stack before: 33856 2824 after: 33812 (-0.1%) 2800 (-0.8%)	2024-05-09 14:16:19 -05:00
Christopher Haster	3f11e6c4a1	Tweaked _compact() arg order This feels more correct. _compact() is notably inconsistent with *_commit/appendattrs(), but I think this is more a case of attrs/attr_count being a special case. Code changes, this seems to just be compiler noise. Arg order can affect quite a bit: code stack before: 33876 2824 after: 33856 (-0.1%) 2824 (+0.0%)	2024-05-04 17:27:51 -05:00
Christopher Haster	580ff26024	Prefer unsigned ints when using sign-bit as a flag Note this is slightly different than cases where we use the sign-bit for muxing two different types, such as `int err` and `lfsr_srid_t rid`. In those cases we'd never extract the lower bits of the int unconditionally. This leads to fewer casts and I think signals the intention of these sign-bit-is-flag ints a bit better. We aren't really interpreting these as signed, and mask out other bits in some cases (lfsr_data_t). This leads to more code in places, I'm guessing because of C treating signed overflow as undefined behavior... Maybe this is a good thing: code stack before: 33856 2824 after: 33876 (+0.1%) 2824 (+0.0%)	2024-05-04 17:27:36 -05:00
Christopher Haster	add985a3f4	Commented out old odd-parity lfs_crc table We've been linking in this now-unused CRC table when we don't need to be: code stack before: 33976 2824 after: 33856 (-0.4%) 2824 (+0.0%) The only catch was it's use in lfs_emubd to provide optional checksums when debugging. But the actual checksum doesn't matter, so this can be migrated to crc32c.	2024-05-04 17:27:27 -05:00
Christopher Haster	6ad4cd5168	Dropped _hastrunk() functions We can just rely on the truthiness of _trunk() here.	2024-05-04 17:27:24 -05:00
Christopher Haster	0ed38211bf	Made lfsr_shrub_t its own struct This now properly encodes the different eoff/estimate field usage between the two types. In theory this could save some RAM, but we don't actually allocate lfsr_shrub_t anywhere it's not unioned with lfsr_btree_t, so: code stack before: 33976 2824 after: 33976 (+0.0%) 2824 (+0.0%)	2024-05-04 17:27:14 -05:00
Christopher Haster	45a4e9ffb4	Moved lfsr_ecksum_t back into lfs.c Now that becksums were proven to not work, we don't need this in lfs.h anymore.	2024-05-04 17:27:10 -05:00
Christopher Haster	ab2a1cb571	Enabled erase=noop in test_rbyd, changed read* to error on leb128 overflow Now that reproducibility issues with erase_value=-1 (erase=noop) are fixed, this much more useful to test than erase_value=0x1b. Especially since erase=noop is filled with so many sharp corners. These tests already found that we were being too confident with our leb128/lleb128/tag parsing. Since we need to partially parse unfinished/ old commits, lfsr_dir_read* can easily encounter invalid leb128s during normal operation. If this happens we should not assert. Doing things correctly has a bit of a cost: code stack before: 33928 2824 after: 33976 (+0.1%) 2824 (+0.0%) At least we haven't seen any issues with our valid bit invalidating logic yet.	2024-05-04 17:27:01 -05:00
Christopher Haster	5fbf073bfb	Fixed issue in emubd where noop erases start out-of-sync with disk Because reproducibility is extremely important, emubd always zeros blocks on the first erase, even when erase_value=-1. Well, at least it should be. We were correctly zeroing the blocks in RAM, but if erase_value=-1 we were leaving the disk unzeroed, causing the disk to fall out of sync. Fixed by zeroing disk in lfs_emubd_createcfg, even if erase_value=-1. Also I went ahead and dropped the bd->disk->scratch block. We're already allocating RAM-backed blocks on erase anyways, so keeping scratch around doesn't really gain us anything anymore. Now there is just a temporary allocation in lfs_emubd_createcfg to zero the disk efficiently during initialization.	2024-05-04 17:26:51 -05:00
Christopher Haster	b122a50b6c	Trying to handle ecksums correctly when erased=>LFS_ERR_CORRUPT It should be legal for block devices to return LFS_ERR_CORRUPT when erased, this is common on devices with ECC, where the erased-state is not valid ECC and results in LFS_ERR_CORRUPT. If anything this is a better indicator than fixed-value erased-state, but we need to make sure we track this with our ecksums consistently. This gets a bit arbitrary. Normally: valid = m[0] & 0x80 cksum = crc32c(m) If bd_read returns LFS_ERR_CORRUPT: valid = 0 & 0x80 cksum = crc32c([]) Yeah, implementing this gets a bit funky, but the code cost is trivial: code stack before: 33924 2824 after: 33928 (+0.0%) 2824 (+0.0%) Note this is only best effort right now, we really need tests over erased=>LFS_ERR_CORRUPT...	2024-05-04 17:26:35 -05:00
Christopher Haster	db85172211	Always calculate/show cksum in dbgblock.py Now that most scripts show relevant cksums, it makes sense for dbgblock.py to just always show a cksum as well. It's not like this has any noticable impact on the script's runtime. Example: $ ./scripts/dbgblock.py disk -b4096 0 block 0x0, size 4096, cksum e6e3ad25 00000000: 01 00 00 00 80 03 00 08 6c 69 74 74 6c 65 66 73 ........littlefs 00000010: 80 04 00 02 00 00 80 05 00 02 fa 01 80 09 00 04 ................ ...	2024-05-04 17:26:29 -05:00
Christopher Haster	dbe503776d	Added lfs_parity intrinsic We're using parity a lot more than popc now (actually, now that we don't use CTZ skip-lists, do we use popc at all?), so it makes sense to the compiler's __builtin_parity intrinsic when possible. On some processors parity can be much cheaper than popc. Notably, the 8080 family just includes a parity flag in the set of carry flags that are implicitly updated on most ALU operations. Though I think this approach didn't scale, you don't really see parity flags on most >8-bit architectures... Unfortunately, ARM thumb, our test arch, does not have a popc or parity instruction. I guess because thanks to implicit shifts in most instructions, the tree-reduction solution is surprisingly cheap: ea80 4010 eor.w r0, r0, r0, lsr #16 ea80 2010 eor.w r0, r0, r0, lsr #8 ea80 1010 eor.w r0, r0, r0, lsr #4 ea80 00c0 eor.w r0, r0, r0, lsr #2 ea80 0050 eor.w r0, r0, r0, lsr #1 f000 0001 and.w r0, r0, #1 Both popc and parity benefit from this (GCC 11): code __popcountsi2: 40 __paritysi2: 32 (-20.0%) So, thumb is not an arch where we see much benefit: code stack before: 33908 2824 after: 33924 (+0.0%) 2824 (+0.0%) Not really sure where the +16 bytes come from, we removed several masks, so I guess it's just bool vs in compiler noise? Still, this may be useful for other archs with parity instructions/ hardware.	2024-05-04 17:25:50 -05:00
Christopher Haster	1c9cc63994	Adopted crc32c xor trick to avoid masking valid bits Turns out these are equivalent: cksum' = crc32c([d & ~0x80], cksum) cksum' = crc32c([d], cksum ^ (d & 0x80)) Which is quite nice. The second form is a bit cheaper and works better in situations where you may have an immutable buffer. I took the long way to find this and may or may not have brute forced an xor mask for the valid bit: crc32c(62 95 e3 fd 00) => c7844d4d crc32c(00 00 00 00 80) => c7844d4d But this is equivalent to 00000080 after xoring in the init junk. If you look at the naive lfs_crc32c impl, the first step is to xor the first byte, so really xoring any byte will cancel it out of our crc32c. Code changes, thought this would save more because we can reuse bd checksumming a bit better... Oh well, at least the theory works: code stack before: 33916 2824 after: 33908 (-0.0%) 2824 (+0.0%)	2024-05-04 17:25:35 -05:00
Christopher Haster	8a75a68d8b	Made rbyd cksums erased-state agnostic Long story short, rbyd checksums are now fully reproducible. If you write the same set of tags to any block, you will end up with the same checksum. This is actually a bit tricky with littlefs's constraints. --- The main problem boils down to erased-state. littlefs has a fairly flexible model for erased-state, and this brings some challenges. In littlefs, storage goes through 2 states: 1. Erase - Prepare storage for progging. Reads after an erase may return arbitrary, but consistent, values. 2. Prog - Program storage with data. Storage must be erased and no progs attempted. Reads after a prog must return the new data. Note in this model erased-state may not be all 0xffs, though it likely will be for flash. This allows littlefs to support a wide range of other storage devices: SD, RAM, NVRAM, encryption, ECC, etc. But this model also means erased-state may be different from block to block, and even different on later erases of the same block. And if that wasn't enough of a challenge, _erased-state can contain perfectly valid commits_. Usually you can expect arbitrary valid cksums to be rare, but thanks to SD, RAM, etc, modeling erase as a noop, valid cksums in erased-state is actually very common. So how do we manage erased-state in our rbyds? First we need some way to detect it, since we can't prog if we're not erased. This is accomplished by the forward-looking erased-state cksum (ecksum): .---+---+---+---. \ \| commit \| \| \| \| \| \| \| \| +---+---+---+---+ +-. \| ecksum -------. \| \| <-- ecksum - cksum of erased state +---+---+---+---+ \| / \| \| cksum --------\|---' <-- cksum - cksum of commit, +---+---+---+---+ \| including ecksum \| padding \| \| \| \| \| +---+---+---+---+ \ \| \| erased \| +-' \| \| / . . . . You may have already noticed the start of our problems. The ecksum contains the erased-state, which is different per-block, and our rbyd cksum contains the ecksum. We need to include the ecksum so we know if it's valid, but this means our rbyd cksum changes block to block. Solving this is simple enough: Stop the rbyd's canonical cksum before the ecksum, but include the ecksum in the actual cksum we write to disk. Future commits will need to start from the canonical cksum, so the old ecksum won't be included in new commits, but this shouldn't be a problem: .---+---+---+---. . . \ . \ . . . . .---+---+---+---. \ \ \| commit \| \| \| \| commit \| \| \| \| \| \| +- rbyd \| \| \| \| \| \| \| \| cksum \| \| \| \| +---+---+---+---+ +-. / +---+---+---+---+ \| \| \| ecksum -------. \| \| \| ecksum \| . . +---+---+---+---+ \| / \| +---+---+---+---+ . . \| cksum --------\|---' \| cksum \| . . +---+---+---+---+ \| +---+---+---+---+ . . \| padding \| \| \| padding \| . . \| \| \| \| \| . . +---+---+---+---+ \ \| . . . . . . . +---+---+---+---+ \| \| \| erased \| +-' \| commit \| \| \| \| \| / \| \| \| +- rbyd . . \| \| \| \| cksum . . +---+---+---+---+ +-. / \| ecksum -------. \| \| +---+---+---+---+ \| / \| \| cksum ------------' +---+---+---+---+ \| \| padding \| \| \| \| \| +---+---+---+---+ \ \| \| erased \| +-' \| \| / . . . . The second challenge is the pesky possibility of existing valid commits. We need some way to ensure that erased-state following a commit does not accidentally contain a valid old commit. This is where are tag's valid bits come into play: The valid bit of each tag must match the parity of all preceding tags (equivalent to the parity of the crc32c), and we can use some perturb bits in the cksum tag to make sure any tags in our erased-state do _not_ match: .---+---+---+---. \ . . . . . .---+---+---+---. \ \ \ \|v\| tag \| \| \|v\| tag \| \| \| \| +---+---+---+---+ \| +---+---+---+---+ \| \| \| \| commit \| \| \| commit \| \| \| \| \| \| \| \| \| \| \| \| +---+---+---+---+ +-----. +---+---+---+---+ +-. \| \| \|v\|p\| tag \| \| \| \|v\|p\| tag \| \| \| \| \| +---+---+---+---+ / \| +---+---+---+---+ / \| \| \| \| cksum \| \| \| cksum \| \| . . +---+---+---+---+ \| +---+---+---+---+ \| . . \| padding \| \| \| padding \| \| . . \| \| \| \| \| \| . . +---+---+---+---+ . . . \| . . +---+---+---+---+ \| \| \| \|v---------------- != --' \|v------------------' \| \| \| erased \| +---+---+---+---+ \| \| . . \| commit \| \| \| . . \| \| \| \| +---+---+---+---+ +-. +-. \|v\|p\| tag \| \| \| \| \| +---+---+---+---+ / \| / \| \| cksum ----------------' +---+---+---+---+ \| \| padding \| \| \| \| \| +---+---+---+---+ \| \|v---------------- != --' \| erased \| . . . . New problem! The rbyd cksum contains the valid bits, which contain the perturb bits, which depends on the erased-state! And you can't just derive the valid bits from the rbyd's canonical cksum. This avoids erased-state poisoning, sure, but then nothing in the new commit depends on the perturb bits! The catch-22 here is that we need the valid bits to both depend on, and ignore, the erased-state poisoned perturb bits. As far as I can tell, the only way around this is to make the rybd's canonical cksum not include the parity bits. Which is annoying, masking out bits is not great for bulk cksum calculation... But this does solve our problem: .---+---+---+---. \ . . . . . .---+---+---+---. \ \ \ \ \|v\| tag \| \| \|v\| tag \| \| \| o o +---+---+---+---+ \| +---+---+---+---+ \| \| \| \| \| commit \| \| \| commit \| \| \| \| \| \| \| \| \| \| \| \| \| \| +---+---+---+---+ +-----. +---+---+---+---+ +-. \| \| \| \|v\|p\| tag \| \| \| \|v\|p\| tag \| \| \| \| . . +---+---+---+---+ / \| +---+---+---+---+ / \| \| . . \| cksum \| \| \| cksum \| \| . . . +---+---+---+---+ \| +---+---+---+---+ \| . . . \| padding \| \| \| padding \| \| . . . \| \| \| \| \| \| . . . +---+---+---+---+ . . . \| . . +---+---+---+---+ \| \| \| \| \|v---------------- != --' \|v------------------' \| o o \| erased \| +---+---+---+---+ \| \| \| . . \| commit \| \| \| +- rbyd . . \| \| \| \| \| cksum +---+---+---+---+ +-. +-. / \|v\|p\| tag \| \| \| o \| +---+---+---+---+ / \| / \| \| cksum ----------------' +---+---+---+---+ \| \| padding \| \| \| \| \| +---+---+---+---+ \| \|v---------------- != --' \| erased \| . . . . Note that because each commit's cksum derives from the canonical cksum, the valid bits and commit cksums no longer contain the same data, so our parity(m) = parity(crc32c(m)) trick no longer works. However our crc32c still does tell us a bit about each tag's parity, so with a couple well-placed xors we can at least avoid needing two parallel calculations: cksum' = crc32c(cksum, m) valid' = parity(cksum' xor cksum) xor valid This also means our commit cksums don't include any information about the valid bits, since we mask these out before cksum calculation. Which is a bit concerning, but as far as I can tell not a real problem. --- An alternative design would be to just keep track of two cksums: A commit cksum and a canonical cksum. This would be much simpler, but would also require storing two cksums in RAM in our lfsr_rbyd_t struct. A bit annoying for our 4-byte crc32cs, and a bit more than a bit annoying for hypothetical 32-byte sha256s. It's also not entirely clear how you would update both crc32cs efficiently. There is a way to xor out the initial state before each tag, but I think it would still require O(n) cycles of crc32c calculation... As it is, the extra bit needed to keep track of commit parity is easy enough to sneak into some unused sign bits in our lfsr_rbyd_t struct. --- I've also gone ahead and mixed in the current commit parity into our cksum's perturb bits, so the commit cksum at least contains _some_ information about the previous parity. But it's not entirely clear this actually adds anything. Our perturb bits aren't _required_ to reflect the commit parity, so a very unlucky power-loss could in theory still make a cksum valid for the wrong parity. At least this situation will be caught by later valid bits... I've also carved out a tag encoding, LFSR_TAG_PERTURB, solely for adding more perturb bits to commit cksums: LFSR_TAG_CKSUM 0x3cpp v-11 cccc -ppp pppp LFSR_TAG_CKSUM 0x30pp v-11 ---- -ppp pppp LFSR_TAG_PERTURB 0x3100 v-11 ---1 ---- ---- LFSR_TAG_ECKSUM 0x3200 v-11 --1- ---- ---- LFSR_TAG_GCKSUMDELTA+ 0x3300 v-11 --11 ---- ---- + Planned This allows for more than 7 perturb bits, and could even mix in the entire previous commit cksum, if we ever think that is worth the RAM tradeoff. LFSR_TAG_PERTURB also has the advantage that it is validated by the cksum tag's valid bit before being included in the commit cksum, which indirectly includes the current commit parity. We may eventually want to use this instead of the cksum tag's perturb bits for this reason, but right now I'm not sure this tiny bit of extra safety is worth the minimum 5-byte per commit overhead... Note if you want perturb bits that are also included in the rbyd's canonical cksum, you can just use an LFSR_TAG_SHRUBDATA tag. Or any unreferenced shrub tag really. --- All of these changes required a decent amount of code, I think mostly just to keep track of the parity bit. But the isolation of rbyd cksums from erased-state is necessary for several future-planned features: code stack before: 33564 2816 after: 33916 (+1.0%) 2824 (+0.3%)	2024-05-04 17:25:01 -05:00
Christopher Haster	c4fcc78814	Tweaked file types/name tag encoding to be a bit less quirky The intention behind the quirky encoding was to leverage bit 1 to indicate if the underlying file type would be backed by the common file B-tree data structure. Looking forward, there may be several of these types, compressed files, contiguous files, etc, that for all intents and purposes are just normal files interpreted differently. But trying to leverage too many bits like this is probably going to give us a sparse, awkward, and confusing tag encoding, so I've reverted to a hopefully more normal encoding: LFSR_TAG_NAME 0x02tt v--- --1- -ttt tttt LFSR_TAG_NAME 0x0200 v--- --1- ---- ---- LFSR_TAG_REG 0x0201 v--- --1- ---- ---1 LFSR_TAG_DIR 0x0202 v--- --1- ---- --1- LFSR_TAG_SYMLINK* 0x0203 v--- --1- ---- --11 LFSR_TAG_BOOKMARK 0x0204 v--- --1- ---- -1-- LFSR_TAG_ORPHAN 0x0205 v--- --1- ---- -1-1 LFSR_TAG_COMPR* 0x0206 v--- --1- ---- -11- LFSR_TAG_CONTIG* 0x0207 v--- --1- ---- -111 * Hypothetical Note the carve-out for the hypothetical symlink tag. Symlinks are actually incredibly low in the priority list, but they are also the only current hypothetical file type that would need to be exposed to users. Grouping these up makes sense. This will get a bit messy if we ever end up with a 4th user-facing type, but there isn't any in POSIX at least (ignoring non-fs types, socket, fifo, character, block, etc). The gap also helps line things up so reg/orphan are a single bit flip, and the non-user facing types all share a bit. This had no impact on code size: code stack before: 33564 2816 after: 33564 (+0.0%) 2816 (+0.0%)	2024-05-04 17:24:48 -05:00
Christopher Haster	6e5d314c20	Tweaked struct tag encoding so b/m tags are earlier These b/m struct tags have a common pattern that would be good to emphasize in the encoding. The later struct tags get a bit more messy as they leave space for future possible extensions. New encoding: LFSR_TAG_STRUCT 0x03tt v--- --11 -ttt ttrr LFSR_TAG_DATA 0x0300 v--- --11 ---- ---- LFSR_TAG_BLOCK 0x0304 v--- --11 ---- -1rr LFSR_TAG_BSHRUB 0x0308 v--- --11 ---- 1--- LFSR_TAG_BTREE 0x030c v--- --11 ---- 11rr LFSR_TAG_MROOT 0x0310 v--- --11 ---1 --rr LFSR_TAG_MDIR 0x0314 v--- --11 ---1 -1rr LFSR_TAG_MSHRUB* 0x0318 v--- --11 ---1 1--- LFSR_TAG_MTREE 0x031c v--- --11 ---1 11rr LFSR_TAG_DID 0x0320 v--- --11 --1- ---- LFSR_TAG_BRANCH 0x032c v--- --11 --1- 11rr * Hypothetical Note that all shrubs currently end with 1---, and all btrees, including the awkward branch tag, end with 11rr. This had no impact on code size: code stack before: 33564 2816 after: 33564 (+0.0%) 2816 (+0.0%)	2024-05-04 17:24:33 -05:00
Christopher Haster	5fa85583cd	Dropped block-level erased-state checksums for RAM-tracked erased-state Unfortunately block-level erased-state checksums (becksums) don't really work as intended. An invalid becksum _does_ signal that a prog has been attempted, but a valid becksum does _not_ prove that a prog has _not_ been attempted. Rbyd ecksums work, but only thanks to a combination of prioritizing valid commits and the use of perturb bits to force erased-state changes. It _is_ possible to end up with an ecksum collision, but only if you 1. lose power before completing a commit, and 2. end up with a non-trivial crc32c collision. If this does happen, at the very least the resulting commit will likely end up corrupted and thrown away later. Block-level becksums, at least as originally designed, don't have either of these protections. To make matters worse, the blocks these becksums reference contain only raw user data. Write 0xffs into a file and you will likely end up with a becksum collision! This is a problem for a couple of reasons: 1. Progging multiple times to erased-state is likely to result in corrupted data, though this is also likely to get caught with validating writes. Worst case, the resulting data looks valid, but with weakened data retention. 2. Because becksums are stored in the copy-on-write metadata of the file, attempting to open a file twice for writing (or more advanced copy-on-write operations in the future) can lead to a situation where a prog is attempted on _already committed_ data. This is very bad and breaks copy-on-write guarantees. --- So clearly becksums are not fit for purpose and should be dropped. What can we replace them with? The first option, implemented here, is RAM-tracked erased state. Give each lfsr_file_t its own eblock/eoff fields to track the last known good erased-state. And before each prog, clear eblock/eoff so we never accidentally prog to the same erased-state twice. It's interesting to note we don't currently clear eblock/eoff in all file handles, this is ok only because we don't currently share eblock/eoff across file handles. Each eblock/eoff is exclusive to the lfsr_file_t and does not appear anywhere else in the system. The main downside of this approach is that, well, the RAM-tracked erase-state is only tracked in RAM. Block-level erased-state effectively does not persist across reboots. I've considered adding some sort of per-file erased-state tracking to the mdir that would need to be cleared before use, but such a mechanism ends up quite complicated. At the moment, I think the best second option is to put erased-state tracking in the future-planned bmap. This would let you opt-in to on-disk tracking of all erased-state in the system. One nice thing about RAM-tracked erased-state is that it's not on disk, so it's not really a compatibility concern and won't get in the way of additional future erased-state tracking. --- Benchmarking becksums vs RAM-tracking has been quite interesting. While in theory becksums can track much more erased-state, it's quite unlikely anything but the most recent erased-state actually ends up used. The end result is no real measurable performance loss, and actually a minor speedup because we don't need to calculate becksums on every block write. There are some pathological cases, such as multiple write heads, but these are out-of-scope right now (note! multiple explicit file handles currently handle this case beautifully because we don't share eblock/eoff!) Becksums were also relatively complicated, and needed extra scaffolding to pass around/propagate as secondary tags alongside the primary bptr. So trading these for RAM-tracking also gives us a nice bit of code/stack savings, albeit at a 2-word RAM cost in lfsr_file_t: code stack structs before: 33888 2864 1096 after: 33564 (-1.0%) 2816 (-1.7%) 1104 (+0.7%) lfsr_file_t before: 104 lfsr_file_t after: 112 (+7.7%)	2024-05-04 17:22:56 -05:00
Christopher Haster	38620675c3	Fixed overallocation of crc32c small-table array Whoops. An unfortunate typo that went unnoticed. This should be a 64-byte (16-int) array, not a 64-int array. A free 192 byte savings! code stack before: 34076 2864 after: 33888 (-0.6%) 2864 (+0.0%)	2024-04-28 13:21:46 -05:00
Christopher Haster	3e7a4e1d74	In dbgblock.py, renamed -x/--cksum -> -c/--cksum I think this makes a bit more sense. I think the original reasoning for -x/--cksum was to match -x/--device in dbgrbyd.py, but that flag no longer exists. This could go all the way back to matching --xsum at some point, but I'm not sure. Common hash related utils, sha256sum, md5sum, etc, use -c/--check to validate their hash, so that's sort of prior art?	2024-04-28 13:21:46 -05:00

1 2 3 4 5 ...

1538 Commits