Commit Graph

1538 Commits

Author SHA1 Message Date
Christopher Haster
03ea2e6ac5 Tweaked cov.py, summary.py, to render fraction percents as notes
This matches how diff percentages are rendered, and simplifies the
internal table rendering by making Frac less of a special case. It also
allows for other type notes in the future.

One concern is how all the notes are shoved to the side, which may make
it a bit harder to find related percentages. If this becomes annoying we
should probably look into interspersing all notes (including diff
percentages) between the relevant columns.

Before:

  function                                   lines            branches
  lfsr_rbyd_appendattr             230/231   99.6%     172/192   89.6%
  lfsr_rbyd_p_recolor                33/34   97.1%       11/12   91.7%
  lfs_alloc                          40/42   95.2%       21/24   87.5%
  lfsr_rbyd_appendcompaction         54/57   94.7%       39/42   92.9%
  ...

After:

  function                           lines    branches
  lfsr_rbyd_appendattr             230/231     172/192 (99.6%, 89.6%)
  lfsr_rbyd_p_recolor                33/34       11/12 (97.1%, 91.7%)
  lfs_alloc                          40/42       21/24 (95.2%, 87.5%)
  lfsr_rbyd_appendcompaction         54/57       39/42 (94.7%, 92.9%)
  ...
2024-05-18 13:00:15 -05:00
Christopher Haster
827dddf62b Fixed make cov lfs.t.gcda -> lfs.t.a.gcda
This was missed when changing the behavior of test/bench suffixes
earlier. Easy fix.
2024-05-18 13:00:15 -05:00
Christopher Haster
1d88fa9864 In scripts -d/--diff, show either all percentages or none
Previously, with -d/--diff, we would only show non-zero percentages. But
this was ambiguous/confusing when dealing with multiple results
(stack.py, summary.py, etc).

To help with this, I've switched to showing all percentages unless all
percentages are zero (no change). This matches the -d/--diff row-hiding
logic, so by default all rows should show all percentages.

Note -p/--percent did not change, as it already showed all percentages
all of the time.
2024-05-18 13:00:15 -05:00
Christopher Haster
7a7da9680a Avoid O(n^2) folding in summary.py
Noticed weird slowness when summarizing test results by suite vs case.
Turns out the way we accumulate results by overloading Python's __add__
quickly leads to O(n^2) behavior as we repeatedly concatenate
increasingly large lists.

Instead of doing anything sane, I've added a second, immutable length to
each list such that we can opportunistically reuse/mutate/append lists
in __add__. The end result should be O(n) most of the time.

Observe:

             lines           bytes
  test.csv: 537749  64551874 62MiB

  ./scripts/summary.py test.csv -ftest_time -S

               before     after
  -bcase:   0m51.772s  0m9.302s (-82.0%)
  -bsuite: 10m29.067s  0m9.357s (-98.5%)
2024-05-18 13:00:15 -05:00
Christopher Haster
4920cb092c Fixed summary.py's float diff rendering precision
The internal Float type was incorrectly inheriting the diff rendering
from Int, which casts to, well, an int.
2024-05-18 13:00:15 -05:00
Christopher Haster
bd4a5e5ab3 Tried to better budget test runtime
The main idea here is that diverse tests are better than many similar
tests.

Sure, if we throw fuzz tests at the system all day we'll eventually find
more bugs, but if a developer is in the loop that time is going to be
better spent writing specific tests targeting the fragile parts of the
system.

And don't worry, we can still throw fuzz tests at the system all day by
specifying explicit seeds with -DSEED=blah.

Changes:

- Limited dir-related powerloss fuzz testing to N <= 16.

  These tests were the biggest culprit of excessive test runtime,
  requiring O(n^2) redundant operations to recover from powerlosses
  (they just replay the full sequence on powerloss).

- As a tradeoff, bumped most fuzz tests to a minimum of 20 seeds.

  The big exception being the test_fwrite tests, which are heavily
  parameterized and already take the most time to run. Each parameter
  combination also multiplies the effective number of seeds, so
  increasing the number of base seeds will probably have diminishing
  returns.

- Limited test_fwrite_reversed to SIZE <= 4*1024*CHUNK.

  Writing a file backwards is just about the worst way you could write a
  file, since all buffering/coalescing expect writes to eventually make
  forward progress. On the flip side, because it's uncommon, writing a
  file backwards is also a great way to find bugs. But at some point a
  compromise needs to be made.

Impacted test runtimes:

  case                                otime    ntime    dtime
  test_btree_push_fuzz                  0.3      0.5     +0.2 (+60.2%)
  test_btree_push_sparse_fuzz           0.4      3.3     +2.9 (+720.4%)
  test_btree_update_fuzz                0.4      0.9     +0.6 (+141.6%)
  test_btree_update_sparse_fuzz         0.5      4.5     +4.1 (+857.4%)
  test_btree_pop_fuzz                   0.6      2.3     +1.7 (+314.7%)
  test_btree_pop_sparse_fuzz            1.2      5.7     +4.4 (+356.2%)
  test_btree_split_fuzz                 0.5      1.4     +0.8 (+150.2%)
  test_btree_split_sparse_fuzz          0.4      5.6     +5.1 (+1163.2%)
  test_btree_find_fuzz                  0.5      0.7     +0.2 (+50.7%)
  test_btree_find_sparse_fuzz           1.0      3.0     +2.0 (+189.8%)
  test_btree_traversal_fuzz             0.6      2.3     +1.6 (+260.4%)
  test_dirs_mkdir_many                  3.3      2.1     -1.3 (-37.8%)
  test_dirs_mkdir_many_backwards        3.5      2.1     -1.4 (-39.9%)
  test_dirs_mkdir_fuzz                115.3    106.4     -8.9 (-7.7%)
  test_dirs_rm_many                   283.9     76.8   -207.0 (-72.9%)
  test_dirs_rm_many_backwards         216.1     80.6   -135.5 (-62.7%)
  test_dirs_rm_fuzz                   647.0     68.5   -578.5 (-89.4%)
  test_dirs_mv_many                    14.2     15.4     +1.1 (+7.9%)
  test_dirs_mv_many_backwards          16.5     14.5     -2.1 (-12.5%)
  test_dirs_mv_fuzz                  1932.5    156.7  -1775.8 (-91.9%)
  test_dirs_general_fuzz              561.9     74.5   -487.4 (-86.7%)
  test_dread_recursive_rm             336.6     46.2   -290.4 (-86.3%)
  test_dread_recursive_mv              55.5     44.6    -11.0 (-19.8%)
  test_fsync_rrrr_fuzz                  0.4      0.3     -0.1 (-18.4%)
  test_fsync_wrrr_fuzz                  8.0     12.4     +4.5 (+56.0%)
  test_fsync_wwww_fuzz                 13.2     33.4    +20.2 (+152.6%)
  test_fsync_wwrr_fuzz                  5.4     50.9    +45.5 (+841.6%)
  test_fsync_rwrw_fuzz                  2.4      8.4     +6.0 (+253.9%)
  test_fsync_rwrw_sparse_fuzz           3.2      7.5     +4.2 (+129.9%)
  test_fsync_rwtfrwtf_sparse_fuzz       6.1      8.5     +2.4 (+39.3%)
  test_fsync_drrr_fuzz                 11.8      9.2     -2.6 (-21.8%)
  test_fsync_wddd_fuzz                  9.3     11.9     +2.6 (+28.0%)
  test_fsync_rwdrwd_fuzz                1.6     33.1    +31.5 (+1963.4%)
  test_fsync_rwdrwd_sparse_fuzz         0.3      1.8     +1.4 (+418.8%)
  test_fsync_rwtfdrwtfd_sparse_fuzz     0.3      1.1     +0.8 (+260.2%)
  test_fwrite_reversed                728.5    345.2   -383.3 (-52.6%)
  TOTAL                              7587.5   3792.3  -3795.2 (-50.0%)
2024-05-18 13:00:09 -05:00
Christopher Haster
a5fe2706bd Added runtime measurements to test.py -o/--output
Now that we have ~20 minutes of tests, it's good to know _why_ the tests
take ~20 minutes, and if this time is being spent well.

This adds the field test_time to test.py's -o/--output, which reports
the runtime of each test in seconds. This can be organized by suite,
case, etc, with our existing csv scripts.

Note I've limited the precision to only milliseconds (%.6f).
Realistically, this is plenty of precision, and with the number of tests
we have extra digits can really add up!

                             lines                   bytes
  test.csv before:          525593          58432541 56MiB
  test.csv full precision:  525593 (+0.0%)  69817693 67MiB (+19.5%)
  test.csv milli precision: 525593 (+0.0%)  63162935 60MiB (+8.1%)

It still takes a bit of time to process this (50.3s), but now we can see
the biggest culprits of our ~20 minute test time:

  $ ./scripts/summary.py test.csv -bcase -ftest_time -S
  case                                               test_time
  ...
  test_fwrite_hole_compaction                             74.4
  test_fwrite_incr                                       109.7
  test_dirs_mkdir_fuzz                                   115.3
  test_fwrite_overwrite_compaction                       132.4
  test_rbyd_fuzz_append_removes                          134.0
  test_rbyd_fuzz_mixed                                   136.3
  test_rbyd_fuzz_sparse                                  137.4
  test_fwrite_w_seek                                     144.1
  test_rbyd_fuzz_create_deletes                          144.8
  test_dirs_rm_many_backwards                            208.4
  test_dirs_rm_many                                      273.8
  test_fwrite_fuzz_unaligned                             283.2
  test_dread_recursive_rm                                316.7
  test_fwrite_fuzz_aligned                               551.0
  test_dirs_general_fuzz                                 552.8
  test_dirs_rm_fuzz                                      632.7
  test_fwrite_reversed                                   719.0
  test_dirs_mv_fuzz                                     1984.8
  TOTAL                                                 7471.3

Note this machine has 6 cores, 12 hthreads, 7471.3/60/6 => 20.8m, which
is why I don't run these tests single threaded.
2024-05-11 23:37:59 -05:00
Christopher Haster
a9e3cad90a Adopted explicit buffers for low/mid-level attrs
It's really frustrating that it's impossible to create an uninitialized
expression with the scope of a compound-literal...

(I'm going to ignore that this is technically possible with alloca.)

The lack of uninitialized compound-literals forces each of our attribute
lists to make a decision: 1. Use an implicit buffer and pay for
zero-initialization? or 2. Use an explicit buffer, adding code noising
and risking out-of-date buffer sizes.

As a compromise, this commit adopts explicit buffers in most of the
low/mid-level layers. Where the code is already pretty noisy, but also
heavily scrutinized and iterated over to reduce code/stack costs. This
leaves the high-level layers with the hopefully safer and more readable
implicit buffers.

You can see this zero initializing has a surprisingly high code cost,
for what is otherwise a noop:

           code          stack
  before: 33828           2632
  after:  33656 (-0.5%)   2632 (+0.0%)
2024-05-10 23:26:00 -05:00
Christopher Haster
b36663f9f3 Tweaked lfsr_attr_isnoop to assert on delta != 0
Now it is fit for purpose and can replace the explicit tag comparison +
assert in lfsr_rbyd_appendattr. Previously we had to check if delta==0,
but now we just assert that delta!=0 is invalid for noops.

Unfortunately this added a couple bytes of code. The disassembly for
lfsr_rbyd_appendattr is all shuffled up, so I guess this is just
compiler noise. At least it's better than an explicit delta check:

                 code          stack
  before:       33820           2632
  check delta:  33832 (+0.0%)   2632 (+0.0%)
  assert delta: 33828 (+0.0%)   2632 (+0.0%)
2024-05-10 22:52:45 -05:00
Christopher Haster
60179e8f56 Replaced macro array-lits with struct-lits to force lvalues
Turns out temporary struct-literals have a slightly better code/stack
footprint than array-literals. I guess because nuances around arrays in
C can cause problems for optimization passes?

This makes forcing lvalues for macro consistency much more appealing:

                           code          stack
  sometimes rvalues:      33780           2640
  array lvalues (before): 33868 (+0.3%)   2640 (+0.0%)
  struct lvalues (after): 33820 (+0.1%)   2632 (-0.3%)
2024-05-10 18:42:17 -05:00
Christopher Haster
216881ede3 Changed all LFSR_DATA/ATTR macros to create lvalues
I think what may be going on with the unexpected stack cost related to
struct passing, is something to do with scoping and how it interacts
with function inlining + shrink wrapping.

Compound-literals have a scope limited by the current statement, and
while temporary structs _should_ have a scope limited to the current
expressions, maybe this scope is getting messed up due to function
inlining?

Still smells like a compiler bug, but if this is true, wrapping the
struct-generating function calls with compound-literals should be more
robust at preventing unexpected stack increases in the future.

As a plus, this makes all LFSR_DATA/ATTR macros lvalues, which is nice
for consistency.

---

Unfortunately, it does seem like GCC 11 is not able to elide moving
compound-literals all that well. Repeatedly nesting trivial
compound-literals results in a measurable increase in code cost, even
though it should theoretically be a noop with optimizations.

This results in an unfortunate code size increase:

           code          stack
  before: 33780           2640
  after:  33868 (+0.3%)   2640 (+0.0%)

But at some point you have to give up trying to work around
insufficiencies in the compiler. I'll take 100 bytes of code over 100
bytes of stack any day.
2024-05-10 18:16:16 -05:00
Christopher Haster
cd22c0d68b Aggressively cleaned up/reworked lfsr_attr_t, consumed lfsr_cat_t
This turned into a sort of system-wide refactor based on learned
knowledge of what we can do with lfsr_attr_t.

The big changes:

- Reverted LFSR_ATTR to mainly take lfsr_data_t again, keeping
  lfsr_data_t as the default data representation in the codebase.

  Now that we know

  LFSR_ATTR_CAT_ still provides concatenation mechanics, and LFSR_ATTR_
  provides a way to edit in-flight lfsr_attr_ts.

- Dropped lfsr_cat_t, replaced with explicit const void* + uint16_t,
  tried to limit to low-level operations and prefer passing aroud
  lfsr_attr_t and lfsr_data_t at a high-level.

  Note this cat + cat_count pair is quite similar to the common attrs +
  attr_count and buffer + size arguments.

- Adopted lfsr_attr_t more in mid-level functions, lfsr_rbyd_appendattr,
  lfsr_rbyd_appendcompactattr, lfsr_file_carve, etc. This is a bit more
  ergonomical, allows for use of LFSR_ATTR* macros, and in theory might
  even save a bit of stack.

Unfortunately this seems to have resulted in a net hit to code cost,
though I still think it's worth it for the internal ergonomics:

           code          stack
  before: 33652           2624
  after:  33780 (+0.4%)   2640 (+0.4%)

Investigating further suggests this may just be the result of compiler
noise and changes to argument placement. lfsr_attr_t does touch a lot of
code...

It's interesting to note the adoption of lfsr_attr_t in
lfsr_rbyd_appendattr* and friends prevents their transformation into
.isra functions, though this doesn't seem to impact code cost too much:

  function (5 added, 5 removed)          osize   nsize   dsize
  lfsr_cat_size                              -      48     +48 (+100.0%)
  lfsr_file_carve                            -    1600   +1600 (+100.0%)
  lfsr_rbyd_appendattr                       -    2120   +2120 (+100.0%)
  lfsr_rbyd_appendattr_                      -     244    +244 (+100.0%)
  lfsr_rbyd_appendcompactattr                -      68     +68 (+100.0%)
  lfsr_rbyd_appendcompactrbyd              144     152      +8 (+5.6%)
  lfsr_file_truncate                       298     314     +16 (+5.4%)
  lfsr_mdir_commit__                      1056    1112     +56 (+5.3%)
  lfsr_mdir_compact__                      502     526     +24 (+4.8%)
  lfsr_rbyd_appendattrs                    132     138      +6 (+4.5%)
  lfsr_file_fruncate                       386     402     +16 (+4.1%)
  lfsr_data_frombtree                       84      86      +2 (+2.4%)
  lfsr_rbyd_appendcksum                    512     520      +8 (+1.6%)
  lfsr_file_opencfg                        572     580      +8 (+1.4%)
  lfsr_rename                              608     616      +8 (+1.3%)
  lfsr_mkdir                               500     504      +4 (+0.8%)
  lfsr_bd_prog                             278     280      +2 (+0.7%)
  lfsr_mdir_commit                        2364    2360      -4 (-0.2%)
  lfsr_bshrub_commit                       716     712      -4 (-0.6%)
  lfsr_file_sync                           526     514     -12 (-2.3%)
  lfsr_file_flush_                        1868    1820     -48 (-2.6%)
  lfsr_remove                              456     436     -20 (-4.4%)
  lfsr_fs_fixgrm                           168     160      -8 (-4.8%)
  lfsr_cat_size.isra.0                      42       -     -42 (-100.0%)
  lfsr_file_carve.isra.0                  1596       -   -1596 (-100.0%)
  lfsr_rbyd_appendattr.isra.0             2088       -   -2088 (-100.0%)
  lfsr_rbyd_appendattr_.isra.0             232       -    -232 (-100.0%)
  lfsr_rbyd_appendcompactattr.isra.0        56       -     -56 (-100.0%)
  TOTAL                                  33652   33780    +128 (+0.4%)
2024-05-10 15:43:08 -05:00
Christopher Haster
0fd955edb7 Prefer tag/size outside of union where possible
If we have control of the struct, such as in lfsr_data_t and lfsr_cat_t,
moving the common tag outside of the union avoids naming ambiguities.

Counter-example: This doesn't work for lfsr_bshrub_t, since the contents
of that union are also used as separate types elsewhere. Fortunately the
common initial sequence union rules kick in here.

No code changes, which is good:

           code          stack
  before: 33652           2624
  after:  33652 (+0.0%)   2624 (+0.0%)
2024-05-10 01:58:54 -05:00
Christopher Haster
643bf5b3e0 Changed lfsr_attr_* helper functions to take lfsr_attr_t by value
Now that lfsr_attr_t is "small", or at least the same size as
lfsr_data_t, it makes sense to change the helper functions to take
lfsr_attr_t by value for consistency. These should all be inlined
anyways.

It's interesting to note there _are_ appendattr/progattr functions, but
these don't take lfsr_attr_t directly since we usually do some
last-minute modification to the attr's weight/tag.

Cost cost is mostly unchanged, actually shaves off a few bytes, which is
a good sign:

           code          stack
  before: 33664           2624
  after:  33652 (-0.0%)   2624 (+0.0%)
2024-05-10 00:34:09 -05:00
Christopher Haster
0eb64d9f10 Brought back compound-literals in inline functions
Compound-literals weren't the culprit after all! It was... RVO
interactions with inlined function arguments?

To be honest I still don't quite understand what's going on, but I
present to you this madness:

          code           stack
  before: 33664           2624
  after:  33664 (+0.0%)   2624 (+0.0%)
2024-05-09 18:52:13 -05:00
Christopher Haster
f39057d2e1 Forced RVO in LFSR_CAT_* macros somehow
I think I'm understanding a bit more how RVO interacts with inline
functions. And by that I mean I'm learning that the way RVO interacts
with inline functions is unfortunately very cursed...

Just take a look at this diff. This change should be a noop. But somehow
it saves 200 bytes of RAM:

           code          stack
  before: 33684           2824
  after:  33664 (-0.1%)   2624 (-7.1%)

I think what's happening is passing the result of lfsr_data_from* into
lfsr_data_cat is somehow preventing RVO, because the parameter would
need to be copied into the right argument slot? (argument registers?)

But we really don't need a copy, because lfsr_data_cat should end up
inlined. By inserting a compound literal, we force RVO, and all of these
unnecessary copies get cleaned up after lfsr_data_cat is inlined.

Keep in mind, in a perfect world, lfsr_data_cat should be a noop.

But I could be wrong about all of this. It's not really clear what the
compiler is doing, and I haven't dived that far into the disassembly...
2024-05-09 18:52:02 -05:00
Christopher Haster
250c1dd57e Replaced LFSR_CAT_DAT with less-hacky lfsr_data_cat
The name is not super important, but note lfsr_data_cat matches
lfsr_attr_cat, which is a nice bit of consistency.

The main change here is the adoption of correct field assignments
instead of a hacky cast forcing lfsr_data_t -> lfsr_cat_t. Tests were
passing even with optimizations, but I was concerned about the longevity
of this approach.

As a plus, we can actually assert on size fitting into a uint16_t thanks
to the inline function.

Unfortunately, this creates a surprising stack penalty:

           code          stack
  before: 33756           2624
  after:  33684 (-0.2%)   2824 (+7.6%)

I've also played around with instead reverting lfsr_data_from* ->
lfsr_cat_from*, and providing the inverse lfsr_cat_data, but nothing
gets us quite back to LFSR_CAT_DAT stack:

                  code          stack
  before:        33756           2624
  lfsr_data_cat: 33684 (-0.2%)   2824 (+7.6%)
  lfsr_cat_data: 33872 (+0.3%)   2736 (+4.3%)

This needs more investigation. Unfortunately I don't think we can revert
this, since correctness wins over code/stack costs...
2024-05-09 18:01:10 -05:00
Christopher Haster
85fad999b8 Readopted 16-bit crammed size lfsr_attr_ts
Now that compound-literals have been identified as the culprit, we can
actually adopt this smaller lfsr_attr_t representation without a random
code/stack increase.

This limits lfsr_cat_t's size field to 16-bits (15-bit size + 1-bit
for concatenated datas), allowing simple small attrs (the most common)
to save a word of RAM:

  lfsr_tag_t               lfsr_attr_t
  .---+---.                .---+---+---+---.
  |  tag  |-----------+--->|  tag  |c|size |
  '---+---'           |    +---+---+---+---+
                    .-|--->|     delta     |
  lfsr_srid_t       | |    +---+---+---+---+
  .---+---+---+---. | | .->|      ptr      |
  |     delta     |-' | |  '---+---+---+---'
  '---+---+---+---'   | |
                      | |
  lfsr_cat_t          | |
  .---+---+---+---.   | |
  |c|size |-----------' |
  +---+---+---+---+     |
  |      ptr      |-----'
  '---+---+---+---'

The non-trivial mapping of lfsr_cat_t to lfsr_attr_t does mean a bit
more complexity on lfsr_cat_t access, but now that we figured out the
compound-literal cost it seems the compiler is able to mostly elide
these.

The end result is some nice stack savings:

           code          stack
  before: 33812           2712
  after:  33756 (-0.2%)   2624 (-3.2%)
2024-05-09 18:01:10 -05:00
Christopher Haster
8aebb37b51 Apparently GCC just really hates compound literals
I've been fiddling around with our LFSR_ATTR macro to try to understand
why making it an inline function costs so much, and it seems like it's
not actually the inline function, but the compound literal that is the
problem. Specifically, returning a compound literal from an inline
function results in surprisingly poor code/stack costs!

I don't really know why this happens. Compiler bug/oversight related to
lvalues/rvalues? Compound literals interfering with RVO? Unsure.

I tried a few other struct initializers just in case it was related to
constness, but it seems the problem is the compound literal:

Inlined comp-lit:

  return (lfsr_attr_t){tag, delta, cat};

Inlined const comp-lit:

  return (const lfsr_attr_t){tag, delta, cat};

Inlined no-init:

  lfsr_attr_t attr;
  attr.tag = tag;
  attr.delta = delta;
  attr.cat = cat;
  return attr;

Inlined init:

  lfsr_attr_t attr = {tag, delta, cat};
  return attr;

Code/stack sizes:

                           code          stack
  macro (before):         33852           2776
  inline comp-lit:        34140 (+0.9%)   2760 (-0.6%)
  inline const comp-list: 34140 (+0.9%)   2760 (-0.6%)
  inline no-init (after): 33812 (-0.1%)   2712 (-2.3%)
  inline init:            33812 (-0.1%)   2712 (-2.3%)

The good news is this at least offers a route forward for crammed 15-bit
attrs.

I guess we should also go reasses other uses of compound literals in the
codebase...
2024-05-09 18:00:58 -05:00
Christopher Haster
f7c3e2f7e1 Changed lfsr_cat_from* back to lfsr_data_from*, keep CAT macros
When we're not dealing with cats/attrs, it seems useful to still have
lfsr_data_from* functions that return our general purpose lfsr_data_t
type.

Added LFSR_CAT_DAT to cheaply convert from lfsr_data_t -> lfsr_cat_t
(at least for simple buffers), and used this to keep the LFSR_CAT_*
macros, which are useful for attr-list construction.

Unfortunately, because lfsr_data_t is 3-words vs lfsr_cat_t 2-words,
this does add both code and stack cost:

           code          stack
  before: 33672           2744
  after:  33852 (+0.5%)   2776 (+1.2%)

It's interesting to note this is _not_ because of any LFSR_CAT_* usage!
I tested this explicitly and lfsr_data_from* -> LFSR_CAT_DAT adds no
cost over the previous lfsr_cat_from* functions. A win for GCC. This
cost only comes from the direct usage of the returns lfsr_data_t types
in our grm handling and branch -> btree encoding.

Still it's an annoying cost... Maybe this should be reverted? The
nuances of lfsr_cat_t vs lfsr_data_t is a bit annoying.
2024-05-09 14:47:28 -05:00
Christopher Haster
d11106a898 Extended LFSR_CAT_* -> LFSR_cat_*_ for implicit/explicit memory
So, for example, these are equivalent:

  lfsr_cat_t cat = LFSR_CAT_BPTR(bptr);

  uint8_t buf[LFSR_BPTR_DSIZE];
  lfsr_cat_t cat = LFSR_CAT_BPTR_(bptr, buf);

The first leads to more readable code, but of course sometimes you need
explicit memory allocations.

This replaces lfsr_cat_frombptr, etc, though those functions are still
available. This name change is more relevant for LFSR_CAT_DATA/DATAS,
which involve bit more complicated macros.
2024-05-09 14:16:31 -05:00
Christopher Haster
2383e7f9f9 Reverted most of the crammed 15-bit lfsr_attr_t representation
I think this was becoming too over-engineered. That and if you're
fighting the compiler too much, sometimes it's best to just let the
compiler win.

There is also the minor benefit of not needing to worry about uint16_t
overflows in lfsr_cat_t. One concern is it's easy for missed lfsr_data_t
indirections to go unnoticed.

This reverts lfsr_attr_t from 3 words -> 4 words, but allows easy
mapping to the lfsr_cat_t part of the lfsr_attr_t:

  lfsr_tag_t               lfsr_attr_t
  .---+---.                .---+---+---+---. . . .---+---+---+---.
  |  tag  |--------------->|  tag  |       |     |  tag  |       |
  '---+---'                +---+---+---+---+     +---+---+---+---+
                    .----->|     delta     |     |     delta     |
  lfsr_srid_t       |      +---+---+---+---+     +---+---+---+---+
  .---+---+---+---. | .--->|c|    size     |     |      cat      |
  |     delta     |-' |    +---+---+---+---+     |               |
  '---+---+---+---'   | .->|      ptr      |     |               |
                      | |  '---+---+---+---' . . '---+---+---+---'
  lfsr_cat_t          | |
  .---+---+---+---.   | |
  |c|    size     |---' |
  +---+---+---+---+     |
  |      ptr      |-----'
  '---+---+---+---'

This doesn't impact stack as much as you would expect due to compiler
overhead around inline functions operating on multi-word structs:

  before 15-bit: 33672           2776
  best 15-bit:   33900 (+0.7%)   2736 (-1.4%)
  prev 15-bit:   34486 (+2.4%)   2912 (+4.9%)
  after:         33672 (+0.0%)   2744 (-1.2%)

I think the only big thing not reverted was switching lfsr_cat_t from
storing the sum of the lfsr_data_ts sizes to just storing the number of
lfsr_data_ts. This is cheaper to initialize (because inline functions
are weirdly costly!), cheaper to iterate over, and we don't actually
need lfsr_cat_size that often, unlike lfsr_data_size.

Eventually we'll probably need to revisit this to optimize 31/15-bit,
etc, littlefs configurations. But that's a problem for another day...
2024-05-09 14:16:31 -05:00
Christopher Haster
7e4d8c1df2 Attempted another implementation of crammed lfsr_attr_ts
The idea here was to carve out explicit space for the lfsr_attr_t tag in
lfsr_cat_t, so that when we create lfsr_attr_ts we don't need multiple
references to lfsr_cat_t and can revert back to a macro:

  lfsr_tag_t               lfsr_attr_t
  .---+---.                .---+---+---+---. . . .---+---+---+---.
  |  tag  |-------. .----->|     delta     |     |     delta     |
  '---+---'       | |      +---+---+---+---+     +---+---+---+---+
                  '-|-+--->|  tag  |c|size |     |      cat      |
  lfsr_srid_t       | |    +---+---+---+---+     |               |
  .---+---+---+---. | | .->|      ptr      |     |               |
  |     delta     |-' | |  '---+---+---+---' . . '---+---+---+---'
  '---+---+---+---'   | |
                      | |
  lfsr_cat_t          | |
  .---+---+---+---.   | |
  | (tag) |c|size |---' |
  +---+---+---+---+     |
  |      ptr      |-----'
  '---+---+---+---'

But as a part of creating the lfsr_attr_t we need to mutate the tag in
lfsr_cat_t, which ended up still needing a static inline function, and
ended up making the code/stack size even worse!

           code          stack
  before: 33900           2736
  after:  34486 (+1.7%)   2912 (+6.4%)

I was hoping to leverage the last-update rule of C99's designated
initializers, but this weird rule didn't quite work how I expected. The
compiler is free to omit earlier struct initializers, even if a later
initializer only partially initializes the struct.
2024-05-09 14:16:31 -05:00
Christopher Haster
2ee6750f2e Adopted crammed 15-bit size lfsr_attr_t representation
Another interesting observation about lfsr_cat_t: We rarely actually
need to represent the full range of data:

- lfsr_file_write:block - Block writes call lfsr_bd_prog directly, don't
  need to be represented as attrs.

- lfsr_file_write:fragment - Fragments already need the full
  concatenated representation because of, uh, potential concatenation.

- lfsr_file_sync:sprouts - Inlined sprouts may use the full range.
  Fortunately we can union an lfsr_data_t with the btree buffer, so
  no extra stack cost.

- lfsr_getuattr/sattr (planned) - User/sys attributes may use the full
  range. But these will probably not be on the stack hot-path.

Most attrs that use the simple buffer lfsr_cat_t representation are used
to encode internal structs, such as leb128s, ecksum, bptr, mptr, etc.
The largest of these right now is our bptr encoding, at 21 bytes, so
these easily fit in a short or a byte.

The choice of 15-bits (reserving one bit for cat/buf representation), is
convenient as it allows us to fit the cat size next to our 16-bit
lfsr_tag_t for free in 32-bit aligned systems:

  lfsr_attr_t
  .---+---+---+---.
  |  tag  |c|size |
  +---+---+---+---+
  |     delta     |
  +---+---+---+---+
  |      ptr      |
  '---+---+---+---'

Unfortunately, it seems like C _really_ wants to fight us on this one.

Not really because of the struct packing, but because of how we want
lfsr_attr_t to interact with lfsr_cat_t:

  lfsr_tag_t               lfsr_attr_t
  .---+---.                .---+---+---+---.
  |  tag  |-----------+--->|  tag  |c|size |
  '---+---'           |    +---+---+---+---+
                    .-|--->|     delta     |
  lfsr_srid_t       | |    +---+---+---+---+
  .---+---+---+---. | | .->|      ptr      |
  |     delta     |-' | |  '---+---+---+---'
  '---+---+---+---'   | |
                      | |
  lfsr_cat_t          | |
  .---+---+---+---.   | |
  |c|size |-----------' |
  +---+---+---+---+     |
  |      ptr      |-----'
  '---+---+---+---'

Initializing two fields with one argument is frustratingly impossible in
C99 unless you want to duplicate the argument tree, which we
_definitely_ don't want to do because this includes the actual data
encoding steps.

The only option is to use a static inline function. You might say "oh,
but static inline costs the same as a macro". But no. Switching to a
static inline function heavily penalizes this approach (6% of stack!):

           code          stack
  macro:  33672           2776
  inline: 34512 (+2.5%)   2952 (+6.3%)
  after:  33900 (+0.7%)   2736 (-1.4%)

At least this does result in net stack savings, even with the inline
function penalty.
2024-05-09 14:16:31 -05:00
Christopher Haster
e4069ee4fc Testing LFSR_ATTR as a static inline function
I'm committing this temporarily mostly just to record some _very_
interesting code/stack measurements.

This explores replacing the LFSR_ATTR macro with an inline function that
does the same thing:

  #define LFSR_ATTR(_tag, _delta, _cat) \
      ((const lfsr_attr_t){_tag, _delta, _cat})

vs:

  #define LFSR_ATTR(_tag, _delta, _cat) \
      lfsr_attr(_tag, _delta, _cat)

  static inline lfsr_attr_t lfsr_attr(
          lfsr_tag_t tag, lfsr_srid_t delta, lfsr_cat_t cat) {
      return (lfsr_attr_t){tag, delta, cat};
  }

The motivation for this is to eventually support more complex
lfsr_attr_t layouts. Specifically, it would be nice if we could break up
the lfsr_cat_t into separate size/ptr fields. Unfortunately we can't
declare temporaries in macros (I wish we had statement expressions), and
we really don't want to duplicate the entire cat tree, so an inline
function seems like the only way to accomplish this...

But static inline functions have the same cost as a macro you say?

No. This assumes a perfect compiler. And it's pretty unfair to compiler
developers to expect a perfect compiler.

To be fair, this is an extremely harsh test. We use LFSR_ATTR _heavily_,
which is why it's getting this much scrutiny. lfsr_attr also both takes
in a 2-word struct, and returns a _4_-word struct, which probably makes
things messy.

Still, the results are concerning:

                  code          stack
  macro:         33672           2776
  inline:        34512 (+2.5%)   2952 (+6.3%)
  always_inline: 34512 (+2.5%)   2952 (+6.3%)
  noinline:      33920 (+0.7%)   2888 (+4.0%)

Measured with GCC 11 -mthumb -Os. I also measured with
__attribute__((always_inline/noinline)) just to see how that affected
things.
2024-05-09 14:16:31 -05:00
Christopher Haster
010c82475f Crammed LFSR_CAT_NAME to reuse the wasted word in the name lfsr_data_t
One annoying thing about lfsr_data_t is when representing in-RAM
buffers, the last word of the struct goes completely unused. This
commit attempts to save this word of RAM in file names by forcibly
(hackily?) truncating the name's lfsr_data_t:

  .---+---+---+---. . . . . .---+---+---+---. . . . . .---+---+---+---.
  |0|  did_size   |         |0|  did_size   |         |    did_data   |
  +---+---+---+---+         +---+---+---+---+         |               |
  |     did_ptr ------.     |     did_ptr ------.     |               |
  +---+---+---+---+   |     +---+---+---+---+   |     |               |
  |    (unused)   |   |     |    (unused)   |   |     |               |
  +---+---+---+---+ . | . . +---+---+---+---+ . | . . +---+---+---+---+
  |0| name_size   |   |     |0| name_size   |   |     |   name_size   |
  +---+---+---+---+   |     +---+---+---+---+   |     |               |
  |    name_ptr   |   |     |    name_ptr   |   |     |               |
  +---+---+---+---+   |     +---+---+---+---+   |     |               |
  |    (unused)   |   |     |      did      | <-'     |               |
  +---+---+---+---+ . | . . |               | . . . . '---+---+---+---'
  |      did      | <-'     '---+---+---+---'
  |               |
  '---+---+---+---'

Curiously, this didn't seem to save RAM but saved a bit of code cost?

I guess this is because 1. file names, despite being very common, don't
occur on the stack hot-path that starts at lfsr_file_sync, and 2. while
we don't save stack cost, sometimes the reduced stack pressure can
reduce stack manipulation instructions:

           code          stack
  before: 33728           2776
  after:  33672 (-0.2%)   2776 (+0.0%)

If we look at the per-function stack cost, we can see the expected minor
stack savings in most functions, albeit outside of the stack hot-path:

  function           oframe  olimit  nframe  nlimit  dframe dlimit
  lfsr_file_open         16    2040      16    2032  +0 -8 (+0.0%, -0.4%)
  lfsr_rename           240    2128     232    2120  -8 -8 (-3.3%, -0.4%)
  lfsr_remove           176    2064     168    2056  -8 -8 (-4.5%, -0.4%)
  lfsr_file_opencfg     136    2024     128    2016  -8 -8 (-5.9%, -0.4%)
2024-05-09 14:16:31 -05:00
Christopher Haster
ca2d0b980c Dropped LFSR_CAT_CAT
With LFSR_CAT_DATAS for explicit arrays of datas, and this biggest use
of concatenated data being a rather explicit construction in
lfsr_file_carve, I don't think we really need LFSR_CAT_CAT.

The only non-hacky use was to define LFSR_CAT_NAME. But we know names
always use exactly 2 datas, so this might as well use LFSR_CAT_DATAS.

I am going to use this soapbox to complain a bit about compound struct
literals. Why do we need an array declaration to elevate temporary
structs to automatic storage duration? I wish you could init a compound
literal with the struct itself...

  ✗ &f()
  ✓ &(uint32_t){f()}
  ✓ (uint32_t[]){f()}

  ✗ &f()
  ✗ &(lfsr_data_t){f()}   :(
  ✓ (lfsr_data_t[]){f()}

Some hacky compound array literals were needed to replace the hacky
LFSR_CAT_CATs in lfsr_file_carve for this reason, but I guess it's a
hack for a hack so...

Code unchanged:

           code          stack
  before: 33728           2776
  after:  33728 (+0.0%)   2776 (+0.0%)
2024-05-09 14:16:31 -05:00
Christopher Haster
77bfcb69ad Ripped out attr-list data/buf allocators, hand allocated necessary state
This attr-list allocator stuff is becoming over-engineered, these
allocations really aren't that complex...

This may have simplified after removing becksums, but if we need that
complexity again we can cross that bridge when we get to it.

Hand-allocating, dropping buf_size/data_count tracking, and refactoring
lfsr_file_carve to pre-encode the right sibling gives us a nice bit of
code/stack savings:

           code          stack
  before: 33796           2808
  after:  33728 (-0.2%)   2776 (-1.1%)
2024-05-09 14:16:31 -05:00
Christopher Haster
78b92cc954 Replaced attr-list datas/buf arrays with union
lfsr_data_t datas[d];   =>  union {
  lfs_size_t data_count;          lfsr_data_t data;
  uint8_t buf[b];                 uint8_t buf[b'];
  lfs_size_t buf_size;        } datas[d+b];
                              lfs_size_t data_count;

This trades off extra bookeeping (data_count + buf_size vs data_count)
for less-tight stack overhead.

But this also saves a significant amount of RAM in lfsr_file_carve,
where we have exclusive fragments/bptrs for our left and right siblings.
So the end stack cost/savings mostly cancel out.

The end result seems like a net benefit for code cost:

           code          stack
  before: 33872           2816
  after:  33796 (-0.2%)   2808 (-0.3%)
2024-05-09 14:16:31 -05:00
Christopher Haster
0509fba9b9 Replaced attr-list arenas with three independent arrays
lfsr_attr_t attrs[a*d*b];  =>  lfsr_attr_t attrs[a];
  lfs_size_t attr_count;         lfs_size_t attr_count;
  lfs_size_t attr_scratch;       lfsr_data_t datas[d];
                                 lfs_size_t data_count;
                                 uint8_t buf[b];
                                 lfs_size_t buf_size;

This mostly reverts the allocator scaffolding needed for the attr-list
arenas (LFS_ALIGNOF, etc). This is the main draw of this change, as it
would be nice to avoid a low-level arena implementation headaches unless
they prove to be worthwhile. Which they haven't really so far...

Unfortunately this comes with another code cost, I think due to the
number of counters needed to keep track of separate attr/data/buf
allocations. At least stack showed a slight improvement:

           code          stack
  before: 33844           2824
  after:  33872 (+0.1%)   2816 (-0.3%)
2024-05-09 14:16:31 -05:00
Christopher Haster
8f3036f1e5 Unified attr-list context into little attr arenas
The idea is for cases where we need to incrementally allocate attrs +
context, to allocate from both sides of a statically allocated attr
array. This keeps all of the attr-list state in one place, simplifying
state allocation:

  .---+---+---+---.
  |      attr     |
  +---+---+---+---+
  |      attr ----------.
  +---+---+---+---+     |
  |      attr --------. |
  +---+---+---+---+   | |
  |       |       |   | |
  |       v       |   | |
  |               |   | |
  |       ^       |   | |
  |       |       |   | |
  +---+---+---+---+   | |
  |      data     | <-' |
  +---+---+---+---+     |
  |  encoded bptr | <---'
  '---+---+---+---'

This is especially useful for the non-terminating tail-recursive
lfsr_btree_commit_, which needs to pass this state through a function
call.

Unfortunately, to make this work we needed to implement more-or-less a
full arena allocator, complete with annoying alignment handling. alignof
isn't even available in C99, so we needed a few more intrinsics:

- LFS_ALIGNOF(t)       - Alignment of type t
- LFS_ALIGNEDSIZEOF(t) - Necessary size to force alignment for t
- LFS_MIN(a, b)        - Compile-time min
- LFS_MAX(a, b)        - Compile-time max

Technically only LFS_ALIGNOF was required, but the others are nice to
have. LFS_MIN/LFS_MAX is also useful anywhere you need to calculate
complicated compile-time sizes.

At least in C11 we get alignof, so we won't need compiler extensions/
hacks for this in the future...

---

Unfortunately this ended up a net-negative. Pushing up the code/stack
cost to near pre-cat levels:

                   code          stack
  before cat:     33856           2824
  before scratch: 33812 (-0.1%)   2800 (-0.8%)
  after:          33844 (-0.0%)   2824 (+0.0%)

I think the two main culprits are 1. the extra logic needed to calculate
alignment, and 2. wasted stack due to aligning scratch space up to the
nearest lfsr_attr_t.
2024-05-09 14:16:31 -05:00
Christopher Haster
88a098c616 Added lfsr_cat_t to represent concatenated data
So now, instead of one data type trying to do everything, we have two:

1. lfsr_data_t - Readable data, either in-RAM or on-disk

2. lfsr_cat_t - Concatenated data for progging, may be either a simple
   in-RAM buffer or an indirect list of lfsr_data_ts

This comes from an observation that most lfsr_attr_t datas were either
simple buffers, NULL, or required the indirect concatenated datas
anyways (concatendated file fragments). By separating lfsr_cat_t and
lfsr_data_t, maybe we can save RAM in lfsr_attr_t by not needing the
three words necessary for the less-common disk references.

Note the interesting tradeoff:

Simple in-RAM buffers/NULL decrease by 1 word (4 bytes):

  lfsr_data_t            lfsr_cat_t
  .---+---+---+---.      .---+---+---+---.
  |0|    size     |  =>  |0|    size     |
  +---+---+---+---+      +---+---+---+---+
  |      ptr      |      |      ptr      |
  +---+---+---+---+      '---+---+---+---'
  |    (unused)   |
  '---+---+---+---'
  '-------.-------'      '-------.-------'
      12 bytes                8 bytes

While on-disk references increase by 2 words (8 bytes):

  lfsr_data_t            lfsr_cat_t          lfsr_data_t
  .---+---+---+---.      .---+---+---+---.   .---+---+---+---.
  |1|    size     |  =>  |1|    size     | .>|1|    size     |
  +---+---+---+---+      +---+---+---+---+ | +---+---+---+---+
  |     block     |      |      ptr -------' |     block     |
  +---+---+---+---+      '---+---+---+---'   +---+---+---+---+
  |      off      |                          |      off      |
  '---+---+---+---'                          '---+---+---+---'
  '-------.-------'      '-----------------.-----------------'
      12 bytes                         20 bytes

Unless the on-disk references also need concatenation, in which case
this still saves 1 word (4 bytes).

Note I'm not sure this type split is generalizable to other systems. In
littlefs we can't use recursion, so progging concatenated datas already
required two nested functions, and we happen to never need to read
concatenated data, allowing us to completely omit that functionality. In
other systems, where maybe disk-reference attrs are more common, this
tradeoff may not make sense.

Some other things to note:

- We're also losing the inlined-data representation in this change.
  Unfortunately earlier lfsr_data_t measurements showed that this didn't
  really contribute much. It saved RAM in name attrs but added quite a
  bit of complexity to lfsr_data_t operations.

- By separating simple/cat and RAM/disk, we reduce the abused size bits
  from 2-bits down to 1-bit. This doesn't really matter for our current
  31/28-bit littlefs impl, but is nice in that it reenables the
  theoretical 31/31-bit littlefs impl without in-RAM data-structure
  changes.

There are a few temporary hacks that need to be figured out, but this is
already showing code/stack savings. Which is fascinating considering the
new lfsr_cat_* functions and increased temporary allocations:

           code          stack
  before: 33856           2824
  after:  33812 (-0.1%)   2800 (-0.8%)
2024-05-09 14:16:19 -05:00
Christopher Haster
3f11e6c4a1 Tweaked *_compact() arg order
This feels more correct.

*_compact() is notably inconsistent with *_commit/appendattrs(), but I
think this is more a case of attrs/attr_count being a special case.

Code changes, this seems to just be compiler noise. Arg order can affect
quite a bit:

           code          stack
  before: 33876           2824
  after:  33856 (-0.1%)   2824 (+0.0%)
2024-05-04 17:27:51 -05:00
Christopher Haster
580ff26024 Prefer unsigned ints when using sign-bit as a flag
Note this is slightly different than cases where we use the sign-bit for
muxing two different types, such as `int err` and `lfsr_srid_t rid`. In
those cases we'd never extract the lower bits of the int
unconditionally.

This leads to fewer casts and I think signals the intention of these
sign-bit-is-flag ints a bit better. We aren't really interpreting these
as signed, and mask out other bits in some cases (lfsr_data_t).

This leads to more code in places, I'm guessing because of C treating
signed overflow as undefined behavior... Maybe this is a good thing:

           code          stack
  before: 33856           2824
  after:  33876 (+0.1%)   2824 (+0.0%)
2024-05-04 17:27:36 -05:00
Christopher Haster
add985a3f4 Commented out old odd-parity lfs_crc table
We've been linking in this now-unused CRC table when we don't
need to be:

           code          stack
  before: 33976           2824
  after:  33856 (-0.4%)   2824 (+0.0%)

The only catch was it's use in lfs_emubd to provide optional checksums
when debugging. But the actual checksum doesn't matter, so this can be
migrated to crc32c.
2024-05-04 17:27:27 -05:00
Christopher Haster
6ad4cd5168 Dropped *_hastrunk() functions
We can just rely on the truthiness of *_trunk() here.
2024-05-04 17:27:24 -05:00
Christopher Haster
0ed38211bf Made lfsr_shrub_t its own struct
This now properly encodes the different eoff/estimate field usage
between the two types.

In theory this could save some RAM, but we don't actually allocate
lfsr_shrub_t anywhere it's not unioned with lfsr_btree_t, so:

           code          stack
  before: 33976           2824
  after:  33976 (+0.0%)   2824 (+0.0%)
2024-05-04 17:27:14 -05:00
Christopher Haster
45a4e9ffb4 Moved lfsr_ecksum_t back into lfs.c
Now that becksums were proven to not work, we don't need this in lfs.h
anymore.
2024-05-04 17:27:10 -05:00
Christopher Haster
ab2a1cb571 Enabled erase=noop in test_rbyd, changed read* to error on leb128 overflow
Now that reproducibility issues with erase_value=-1 (erase=noop) are
fixed, this much more useful to test than erase_value=0x1b. Especially
since erase=noop is filled with so many sharp corners.

These tests already found that we were being too confident with our
leb128/lleb128/tag parsing. Since we need to partially parse unfinished/
old commits, lfsr_dir_read* can easily encounter invalid leb128s during
normal operation. If this happens we should not assert.

Doing things correctly has a bit of a cost:

           code          stack
  before: 33928           2824
  after:  33976 (+0.1%)   2824 (+0.0%)

At least we haven't seen any issues with our valid bit invalidating
logic yet.
2024-05-04 17:27:01 -05:00
Christopher Haster
5fbf073bfb Fixed issue in emubd where noop erases start out-of-sync with disk
Because reproducibility is extremely important, emubd always zeros
blocks on the first erase, even when erase_value=-1.

Well, at least it should be. We were correctly zeroing the blocks in
RAM, but if erase_value=-1 we were leaving the disk unzeroed, causing
the disk to fall out of sync.

Fixed by zeroing disk in lfs_emubd_createcfg, even if erase_value=-1.

Also I went ahead and dropped the bd->disk->scratch block. We're already
allocating RAM-backed blocks on erase anyways, so keeping scratch around
doesn't really gain us anything anymore. Now there is just a temporary
allocation in lfs_emubd_createcfg to zero the disk efficiently during
initialization.
2024-05-04 17:26:51 -05:00
Christopher Haster
b122a50b6c Trying to handle ecksums correctly when erased=>LFS_ERR_CORRUPT
It should be legal for block devices to return LFS_ERR_CORRUPT when
erased, this is common on devices with ECC, where the erased-state is
not valid ECC and results in LFS_ERR_CORRUPT.

If anything this is a better indicator than fixed-value erased-state,
but we need to make sure we track this with our ecksums consistently.
This gets a bit arbitrary.

Normally:

  valid = m[0] & 0x80
  cksum = crc32c(m)

If bd_read returns LFS_ERR_CORRUPT:

  valid = 0 & 0x80
  cksum = crc32c([])

Yeah, implementing this gets a bit funky, but the code cost is trivial:

           code          stack
  before: 33924           2824
  after:  33928 (+0.0%)   2824 (+0.0%)

Note this is only best effort right now, we really need tests over
erased=>LFS_ERR_CORRUPT...
2024-05-04 17:26:35 -05:00
Christopher Haster
db85172211 Always calculate/show cksum in dbgblock.py
Now that most scripts show relevant cksums, it makes sense for
dbgblock.py to just always show a cksum as well. It's not like this has
any noticable impact on the script's runtime.

Example:

  $ ./scripts/dbgblock.py disk -b4096 0
  block 0x0, size 4096, cksum e6e3ad25
  00000000: 01 00 00 00 80 03 00 08 6c 69 74 74 6c 65 66 73  ........littlefs
  00000010: 80 04 00 02 00 00 80 05 00 02 fa 01 80 09 00 04  ................
  ...
2024-05-04 17:26:29 -05:00
Christopher Haster
dbe503776d Added lfs_parity intrinsic
We're using parity a lot more than popc now (actually, now that we don't
use CTZ skip-lists, do we use popc at all?), so it makes sense to the
compiler's __builtin_parity intrinsic when possible.

On some processors parity can be much cheaper than popc. Notably, the
8080 family just includes a parity flag in the set of carry flags that
are implicitly updated on most ALU operations. Though I think this
approach didn't scale, you don't really see parity flags on most >8-bit
architectures...

Unfortunately, ARM thumb, our test arch, does not have a popc or parity
instruction. I guess because thanks to implicit shifts in most
instructions, the tree-reduction solution is surprisingly cheap:

  ea80 4010   eor.w   r0, r0, r0, lsr #16
  ea80 2010   eor.w   r0, r0, r0, lsr #8
  ea80 1010   eor.w   r0, r0, r0, lsr #4
  ea80 00c0   eor.w   r0, r0, r0, lsr #2
  ea80 0050   eor.w   r0, r0, r0, lsr #1
  f000 0001   and.w   r0, r0, #1

Both popc and parity benefit from this (GCC 11):

                 code
  __popcountsi2:   40
  __paritysi2:     32 (-20.0%)

So, thumb is not an arch where we see much benefit:

           code          stack
  before: 33908           2824
  after:  33924 (+0.0%)   2824 (+0.0%)

Not really sure where the +16 bytes come from, we removed several masks,
so I guess it's just bool vs in compiler noise?

Still, this may be useful for other archs with parity instructions/
hardware.
2024-05-04 17:25:50 -05:00
Christopher Haster
1c9cc63994 Adopted crc32c xor trick to avoid masking valid bits
Turns out these are equivalent:

  cksum' = crc32c([d & ~0x80], cksum)
  cksum' = crc32c([d], cksum ^ (d & 0x80))

Which is quite nice. The second form is a bit cheaper and works better
in situations where you may have an immutable buffer.

I took the long way to find this and may or may not have brute forced
an xor mask for the valid bit:

  crc32c(62 95 e3 fd 00) => c7844d4d
  crc32c(00 00 00 00 80) => c7844d4d

But this is equivalent to 00000080 after xoring in the init junk.

If you look at the naive lfs_crc32c impl, the first step is to xor the
first byte, so really xoring any byte will cancel it out of our crc32c.

Code changes, thought this would save more because we can reuse bd
checksumming a bit better... Oh well, at least the theory works:

           code          stack
  before: 33916           2824
  after:  33908 (-0.0%)   2824 (+0.0%)
2024-05-04 17:25:35 -05:00
Christopher Haster
8a75a68d8b Made rbyd cksums erased-state agnostic
Long story short, rbyd checksums are now fully reproducible. If you
write the same set of tags to any block, you will end up with the same
checksum.

This is actually a bit tricky with littlefs's constraints.

---

The main problem boils down to erased-state. littlefs has a fairly
flexible model for erased-state, and this brings some challenges. In
littlefs, storage goes through 2 states:

1. Erase - Prepare storage for progging. Reads after an erase may return
   arbitrary, but consistent, values.

2. Prog - Program storage with data. Storage must be erased and no progs
   attempted. Reads after a prog must return the new data.

Note in this model erased-state may not be all 0xffs, though it likely
will be for flash. This allows littlefs to support a wide range of
other storage devices: SD, RAM, NVRAM, encryption, ECC, etc.

But this model also means erased-state may be different from block to
block, and even different on later erases of the same block.

And if that wasn't enough of a challenge, _erased-state can contain
perfectly valid commits_. Usually you can expect arbitrary valid cksums
to be rare, but thanks to SD, RAM, etc, modeling erase as a noop, valid
cksums in erased-state is actually very common.

So how do we manage erased-state in our rbyds?

First we need some way to detect it, since we can't prog if we're not
erased. This is accomplished by the forward-looking erased-state cksum
(ecksum):

  .---+---+---+---.     \
  |     commit    |     |
  |               |     |
  |               |     |
  +---+---+---+---+     +-.
  |     ecksum -------. | | <-- ecksum - cksum of erased state
  +---+---+---+---+   | / |
  |     cksum --------|---' <-- cksum - cksum of commit,
  +---+---+---+---+   |                 including ecksum
  |    padding    |   |
  |               |   |
  +---+---+---+---+ \ |
  |     erased    | +-'
  |               | /
  .               .
  .               .

You may have already noticed the start of our problems. The ecksum
contains the erased-state, which is different per-block, and our rbyd
cksum contains the ecksum. We need to include the ecksum so we know if
it's valid, but this means our rbyd cksum changes block to block.

Solving this is simple enough: Stop the rbyd's canonical cksum before
the ecksum, but include the ecksum in the actual cksum we write to disk.

Future commits will need to start from the canonical cksum, so the old
ecksum won't be included in new commits, but this shouldn't be a
problem:

  .---+---+---+---. . . \ . \ . . . . .---+---+---+---.     \   \
  |     commit    |     |   |         |     commit    |     |   |
  |               |     |   +- rbyd   |               |     |   |
  |               |     |   |  cksum  |               |     |   |
  +---+---+---+---+     +-. /         +---+---+---+---+     |   |
  |     ecksum -------. | |           |     ecksum    |     .   .
  +---+---+---+---+   | / |           +---+---+---+---+     .   .
  |     cksum --------|---'           |     cksum     |     .   .
  +---+---+---+---+   |               +---+---+---+---+     .   .
  |    padding    |   |               |    padding    |     .   .
  |               |   |               |               |     .   .
  +---+---+---+---+ \ | . . . . . . . +---+---+---+---+     |   |
  |     erased    | +-'               |     commit    |     |   |
  |               | /                 |               |     |   +- rbyd
  .               .                   |               |     |   |  cksum
  .               .                   +---+---+---+---+     +-. /
                                      |     ecksum -------. | |
                                      +---+---+---+---+   | / |
                                      |     cksum ------------'
                                      +---+---+---+---+   |
                                      |    padding    |   |
                                      |               |   |
                                      +---+---+---+---+ \ |
                                      |     erased    | +-'
                                      |               | /
                                      .               .
                                      .               .

The second challenge is the pesky possibility of existing valid commits.
We need some way to ensure that erased-state following a commit does not
accidentally contain a valid old commit.

This is where are tag's valid bits come into play: The valid bit of each
tag must match the parity of all preceding tags (equivalent to the
parity of the crc32c), and we can use some perturb bits in the cksum tag
to make sure any tags in our erased-state do _not_ match:

  .---+---+---+---. \ . . . . . .---+---+---+---. \   \   \
  |v|    tag      | |           |v|    tag      | |   |   |
  +---+---+---+---+ |           +---+---+---+---+ |   |   |
  |     commit    | |           |     commit    | |   |   |
  |               | |           |               | |   |   |
  +---+---+---+---+ +-----.     +---+---+---+---+ +-. |   |
  |v|p|  tag      | |     |     |v|p|  tag      | | | |   |
  +---+---+---+---+ /     |     +---+---+---+---+ / | |   |
  |     cksum     |       |     |     cksum     |   | .   .
  +---+---+---+---+       |     +---+---+---+---+   | .   .
  |    padding    |       |     |    padding    |   | .   .
  |               |       |     |               |   | .   .
  +---+---+---+---+ . . . | . . +---+---+---+---+   | |   |
  |v---------------- != --'     |v------------------' |   |
  |     erased    |             +---+---+---+---+     |   |
  .               .             |     commit    |     |   |
  .               .             |               |     |   |
                                +---+---+---+---+     +-. +-.
                                |v|p|  tag      |     | | | |
                                +---+---+---+---+     / | / |
                                |     cksum ----------------'
                                +---+---+---+---+       |
                                |    padding    |       |
                                |               |       |
                                +---+---+---+---+       |
                                |v---------------- != --'
                                |     erased    |
                                .               .
                                .               .

New problem! The rbyd cksum contains the valid bits, which contain the
perturb bits, which depends on the erased-state!

And you can't just derive the valid bits from the rbyd's canonical
cksum. This avoids erased-state poisoning, sure, but then nothing in the
new commit depends on the perturb bits! The catch-22 here is that we
need the valid bits to both depend on, and ignore, the erased-state
poisoned perturb bits.

As far as I can tell, the only way around this is to make the rybd's
canonical cksum not include the parity bits. Which is annoying, masking
out bits is not great for bulk cksum calculation...

But this does solve our problem:

  .---+---+---+---. \ . . . . . .---+---+---+---. \   \   \   \
  |v|    tag      | |           |v|    tag      | |   |   o   o
  +---+---+---+---+ |           +---+---+---+---+ |   |   |   |
  |     commit    | |           |     commit    | |   |   |   |
  |               | |           |               | |   |   |   |
  +---+---+---+---+ +-----.     +---+---+---+---+ +-. |   |   |
  |v|p|  tag      | |     |     |v|p|  tag      | | | |   .   .
  +---+---+---+---+ /     |     +---+---+---+---+ / | |   .   .
  |     cksum     |       |     |     cksum     |   | .   .   .
  +---+---+---+---+       |     +---+---+---+---+   | .   .   .
  |    padding    |       |     |    padding    |   | .   .   .
  |               |       |     |               |   | .   .   .
  +---+---+---+---+ . . . | . . +---+---+---+---+   | |   |   |
  |v---------------- != --'     |v------------------' |   o   o
  |     erased    |             +---+---+---+---+     |   |   |
  .               .             |     commit    |     |   |   +- rbyd
  .               .             |               |     |   |   |  cksum
                                +---+---+---+---+     +-. +-. /
                                |v|p|  tag      |     | | o |
                                +---+---+---+---+     / | / |
                                |     cksum ----------------'
                                +---+---+---+---+       |
                                |    padding    |       |
                                |               |       |
                                +---+---+---+---+       |
                                |v---------------- != --'
                                |     erased    |
                                .               .
                                .               .

Note that because each commit's cksum derives from the canonical cksum,
the valid bits and commit cksums no longer contain the same data, so our
parity(m) = parity(crc32c(m)) trick no longer works.

However our crc32c still does tell us a bit about each tag's parity, so
with a couple well-placed xors we can at least avoid needing two
parallel calculations:

  cksum' = crc32c(cksum, m)
  valid' = parity(cksum' xor cksum) xor valid

This also means our commit cksums don't include any information about
the valid bits, since we mask these out before cksum calculation. Which
is a bit concerning, but as far as I can tell not a real problem.

---

An alternative design would be to just keep track of two cksums: A
commit cksum and a canonical cksum.

This would be much simpler, but would also require storing two cksums in
RAM in our lfsr_rbyd_t struct. A bit annoying for our 4-byte crc32cs,
and a bit more than a bit annoying for hypothetical 32-byte sha256s.

It's also not entirely clear how you would update both crc32cs
efficiently. There is a way to xor out the initial state before each
tag, but I think it would still require O(n) cycles of crc32c
calculation...

As it is, the extra bit needed to keep track of commit parity is easy
enough to sneak into some unused sign bits in our lfsr_rbyd_t struct.

---

I've also gone ahead and mixed in the current commit parity into our
cksum's perturb bits, so the commit cksum at least contains _some_
information about the previous parity.

But it's not entirely clear this actually adds anything. Our perturb
bits aren't _required_ to reflect the commit parity, so a very unlucky
power-loss could in theory still make a cksum valid for the wrong
parity.

At least this situation will be caught by later valid bits...

I've also carved out a tag encoding, LFSR_TAG_PERTURB, solely for adding
more perturb bits to commit cksums:

  LFSR_TAG_CKSUM          0x3cpp  v-11 cccc -ppp pppp

  LFSR_TAG_CKSUM          0x30pp  v-11 ---- -ppp pppp
  LFSR_TAG_PERTURB        0x3100  v-11 ---1 ---- ----
  LFSR_TAG_ECKSUM         0x3200  v-11 --1- ---- ----
  LFSR_TAG_GCKSUMDELTA+   0x3300  v-11 --11 ---- ----

  + Planned

This allows for more than 7 perturb bits, and could even mix in the
entire previous commit cksum, if we ever think that is worth the RAM
tradeoff.

LFSR_TAG_PERTURB also has the advantage that it is validated by the
cksum tag's valid bit before being included in the commit cksum, which
indirectly includes the current commit parity. We may eventually want to
use this instead of the cksum tag's perturb bits for this reason, but
right now I'm not sure this tiny bit of extra safety is worth the
minimum 5-byte per commit overhead...

Note if you want perturb bits that are also included in the rbyd's
canonical cksum, you can just use an LFSR_TAG_SHRUBDATA tag. Or any
unreferenced shrub tag really.

---

All of these changes required a decent amount of code, I think mostly
just to keep track of the parity bit. But the isolation of rbyd cksums
from erased-state is necessary for several future-planned features:

           code          stack
  before: 33564           2816
  after:  33916 (+1.0%)   2824 (+0.3%)
2024-05-04 17:25:01 -05:00
Christopher Haster
c4fcc78814 Tweaked file types/name tag encoding to be a bit less quirky
The intention behind the quirky encoding was to leverage bit 1 to
indicate if the underlying file type would be backed by the common file
B-tree data structure. Looking forward, there may be several of these
types, compressed files, contiguous files, etc, that for all intents and
purposes are just normal files interpreted differently.

But trying to leverage too many bits like this is probably going to give
us a sparse, awkward, and confusing tag encoding, so I've reverted to a
hopefully more normal encoding:

  LFSR_TAG_NAME           0x02tt  v--- --1- -ttt tttt

  LFSR_TAG_NAME           0x0200  v--- --1- ---- ----
  LFSR_TAG_REG            0x0201  v--- --1- ---- ---1
  LFSR_TAG_DIR            0x0202  v--- --1- ---- --1-
  LFSR_TAG_SYMLINK*       0x0203  v--- --1- ---- --11
  LFSR_TAG_BOOKMARK       0x0204  v--- --1- ---- -1--
  LFSR_TAG_ORPHAN         0x0205  v--- --1- ---- -1-1
  LFSR_TAG_COMPR*         0x0206  v--- --1- ---- -11-
  LFSR_TAG_CONTIG*        0x0207  v--- --1- ---- -111

  * Hypothetical

Note the carve-out for the hypothetical symlink tag. Symlinks are
actually incredibly low in the priority list, but they are also
the only current hypothetical file type that would need to be exposed to
users. Grouping these up makes sense.

This will get a bit messy if we ever end up with a 4th user-facing type,
but there isn't any in POSIX at least (ignoring non-fs types, socket,
fifo, character, block, etc).

The gap also helps line things up so reg/orphan are a single bit flip,
and the non-user facing types all share a bit.

This had no impact on code size:

           code          stack
  before: 33564           2816
  after:  33564 (+0.0%)   2816 (+0.0%)
2024-05-04 17:24:48 -05:00
Christopher Haster
6e5d314c20 Tweaked struct tag encoding so b*/m* tags are earlier
These b*/m* struct tags have a common pattern that would be good to
emphasize in the encoding. The later struct tags get a bit more messy as
they leave space for future possible extensions.

New encoding:

  LFSR_TAG_STRUCT         0x03tt  v--- --11 -ttt ttrr

  LFSR_TAG_DATA           0x0300  v--- --11 ---- ----
  LFSR_TAG_BLOCK          0x0304  v--- --11 ---- -1rr
  LFSR_TAG_BSHRUB         0x0308  v--- --11 ---- 1---
  LFSR_TAG_BTREE          0x030c  v--- --11 ---- 11rr
  LFSR_TAG_MROOT          0x0310  v--- --11 ---1 --rr
  LFSR_TAG_MDIR           0x0314  v--- --11 ---1 -1rr
  LFSR_TAG_MSHRUB*        0x0318  v--- --11 ---1 1---
  LFSR_TAG_MTREE          0x031c  v--- --11 ---1 11rr
  LFSR_TAG_DID            0x0320  v--- --11 --1- ----
  LFSR_TAG_BRANCH         0x032c  v--- --11 --1- 11rr

  * Hypothetical

Note that all shrubs currently end with 1---, and all btrees, including
the awkward branch tag, end with 11rr.

This had no impact on code size:

           code          stack
  before: 33564           2816
  after:  33564 (+0.0%)   2816 (+0.0%)
2024-05-04 17:24:33 -05:00
Christopher Haster
5fa85583cd Dropped block-level erased-state checksums for RAM-tracked erased-state
Unfortunately block-level erased-state checksums (becksums) don't really
work as intended.

An invalid becksum _does_ signal that a prog has been attempted, but a
valid becksum does _not_ prove that a prog has _not_ been attempted.

Rbyd ecksums work, but only thanks to a combination of prioritizing
valid commits and the use of perturb bits to force erased-state changes.
It _is_ possible to end up with an ecksum collision, but only if you
1. lose power before completing a commit, and 2. end up with a
non-trivial crc32c collision. If this does happen, at the very least the
resulting commit will likely end up corrupted and thrown away later.

Block-level becksums, at least as originally designed, don't have either
of these protections. To make matters worse, the blocks these becksums
reference contain only raw user data. Write 0xffs into a file and you
will likely end up with a becksum collision!

This is a problem for a couple of reasons:

1. Progging multiple times to erased-state is likely to result in
   corrupted data, though this is also likely to get caught with
   validating writes.

   Worst case, the resulting data looks valid, but with weakened data
   retention.

2. Because becksums are stored in the copy-on-write metadata of the
   file, attempting to open a file twice for writing (or more advanced
   copy-on-write operations in the future) can lead to a situation where
   a prog is attempted on _already committed_ data.

   This is very bad and breaks copy-on-write guarantees.

---

So clearly becksums are not fit for purpose and should be dropped. What
can we replace them with?

The first option, implemented here, is RAM-tracked erased state. Give
each lfsr_file_t its own eblock/eoff fields to track the last known good
erased-state. And before each prog, clear eblock/eoff so we never
accidentally prog to the same erased-state twice.

It's interesting to note we don't currently clear eblock/eoff in all
file handles, this is ok only because we don't currently share
eblock/eoff across file handles. Each eblock/eoff is exclusive to the
lfsr_file_t and does not appear anywhere else in the system.

The main downside of this approach is that, well, the RAM-tracked
erase-state is only tracked in RAM. Block-level erased-state effectively
does not persist across reboots. I've considered adding some sort of
per-file erased-state tracking to the mdir that would need to be cleared
before use, but such a mechanism ends up quite complicated.

At the moment, I think the best second option is to put erased-state
tracking in the future-planned bmap. This would let you opt-in to
on-disk tracking of all erased-state in the system.

One nice thing about RAM-tracked erased-state is that it's not on disk,
so it's not really a compatibility concern and won't get in the way of
additional future erased-state tracking.

---

Benchmarking becksums vs RAM-tracking has been quite interesting. While
in theory becksums can track much more erased-state, it's quite unlikely
anything but the most recent erased-state actually ends up used. The end
result is no real measurable performance loss, and actually a minor
speedup because we don't need to calculate becksums on every block
write.

There are some pathological cases, such as multiple write heads, but
these are out-of-scope right now (note! multiple explicit file handles
currently handle this case beautifully because we don't share
eblock/eoff!)

Becksums were also relatively complicated, and needed extra scaffolding
to pass around/propagate as secondary tags alongside the primary bptr.
So trading these for RAM-tracking also gives us a nice bit of code/stack
savings, albeit at a 2-word RAM cost in lfsr_file_t:

           code          stack          structs
  before: 33888           2864             1096
  after:  33564 (-1.0%)   2816 (-1.7%)     1104 (+0.7%)

  lfsr_file_t before: 104
  lfsr_file_t after:  112 (+7.7%)
2024-05-04 17:22:56 -05:00
Christopher Haster
38620675c3 Fixed overallocation of crc32c small-table array
Whoops. An unfortunate typo that went unnoticed. This should be a
64-byte (16-int) array, not a 64-int array.

A free 192 byte savings!

           code          stack
  before: 34076           2864
  after:  33888 (-0.6%)   2864 (+0.0%)
2024-04-28 13:21:46 -05:00
Christopher Haster
3e7a4e1d74 In dbgblock.py, renamed -x/--cksum -> -c/--cksum
I think this makes a bit more sense.

I think the original reasoning for -x/--cksum was to match -x/--device
in dbgrbyd.py, but that flag no longer exists. This could go all the way
back to matching --xsum at some point, but I'm not sure.

Common hash related utils, sha256sum, md5sum, etc, use -c/--check to
validate their hash, so that's sort of prior art?
2024-04-28 13:21:46 -05:00