Commit Graph

1147 Commits

Author SHA1 Message Date
Christopher Haster
6d8eb948d1 Tweaked tracebd.py to prioritize progs over erases
Yes, erases are the more costly operation that we should highlight. But,
aside from broken code, you can never prog more than you erase.

This makes it more useful to priortize progs over erases, so erases
without an overlaying prog show up as a relatively unique blue,
indicating regions of memory that have been erased but not progged.

Too many erased-but-not-progged regions indicate a potentially wastefull
algorithm.
2023-10-24 02:18:40 -05:00
Christopher Haster
d1e79bffc7 Renamed crystallize_size -> crystal_size
The original name was a bit of a mouthful.

Also dropped the default crystal_size in the test/bench runners
block_size/4 -> block_size/8. I'm already noticing large amounts of
inflation when blocks are fragmented, though I am experimenting with a
rather small fragment_size right now.

Future benchmarks/experimentation is required to figure out good values
for these.
2023-10-23 12:27:44 -05:00
Christopher Haster
e25d11c33c Extended new "fragmenting" write strategy to file btrees
Note this is really just a proof of concept, and tests are not passing.
There's also a number of hacks holding everything together and really
need to be cleaned up.

I was hoping it would be possible to deduplicate the carveshrub/carvetree
functions the same way shrub/tree readnext functions were deduplicated.
These both share a lot of subtle logic, and in theory operated on minor
variations of the same underlying rbyd structure, but in practice
several issues get in the way:

- While the logic is the same, the way changes are played out is very
  different: btrees commit attributes to the btree immediately, whereas
  shrubs build up a bounded attr list to commit to the shrub via an mdir
  commit.

  In theory shrubs could be committed immediately, but it would be
  wasteful. And btrees can't commit a bounded attribute list because 1.
  rm attrs may need to be split into an unbounded number accross
  multiple rbyds, 2. fragmenting blocks may create an unbounded
  headache, and 3. attribute lists can't span multiple rbyds so we'd
  need to manually play them out anyways.

- We need to allocate a new btree in carvetree, but in carveshrub we
  defer allocation to mdir commit time (because of the potential for
  failed commits). This complicates things.

- The unions with sprouts/direct bptrs are often very similar, but need
  different handling when carving. This gets a bit tricky.

- In theory you could switch between building attrs for shrubs and
  immediate commits for btrees, but since the immediate commits _change
  the tree_, the carving math changes subtlely.

- carveshrub needs to do several auxilary things: track the shrub estimate,
  build attrs in RAM, etc. carvetree needs to do several auxilary
  things: dereference bptrs, fragment bptrs, allocate new btrees, etc.
  If these can be deduplicated it would likely result in code savings,
  but also risks increased RAM costs from trying to do too many things
  at once.

  The cost of two functions may also be more cognitive than real, since
  the subtletly here is just math. And computers happen to be pretty
  good at math.

  Though this concern may be unfounded, and deduplicated these functions
  is still enticing and an interesting idea to explore.

I've already noticed some concerning performance once a write exceeds
our crystallization threshold. This makes sense, as our current strategy
is to completely rewrite any data region over our crystallization
threshold. But I wonder if there's a way to exclude the first block in
our region from the crystallization heuristic...

Anyways, some good progress here, but more work to be done.
2023-10-21 22:51:48 -05:00
Christopher Haster
c815c19c20 New "fragmenting" write strategy
The attempt to implement in-rbyd data slicing, being lazily coalesced
during rbyd compaction, failed pretty much completely.

Slicing is a very enticing write strategy, getting both minimal overhead
post-compaction and fast random write speeds, but the idea has some
fundamental conflicts with how we play out attrs post-compaction.

This idea might work in a more powerful filesystem, but brings back the
need to simulate rbyds in RAM, which is something I really don't want to
do (complex, bug-prone, likely adds code cost, may not even be tractable).

So, third time's the charm?

---

This new write strategy writes only datas and bptrs, and avoids dagging
by completely rewriting any regions of data larger than a configurable
crystallization threshold.

This loses most of the benefits of data crystallization, random writes
will now usually need to rewrite a full block, but as a tradeoff our
data at rest is always stored with optimal overhead.

And at least data crystallization still saves space when our data isn't
block aligned, or in sparse files. From reading up on some other
filesystem designs it seems this is a desirable optimization sometimes
referred to as "tail-packing" or "block suballocation"

Some other changes from just having more time to think about the
problem:

1. Instead of scanning to figure out our current crystal size, we can
   use a simple heuristic of 1. look up left block, 2. look up right
   block, 3. assume any data between these blocks contribute to our
   current crystal.

   This is just a heuristic, so worst case you write the first and last
   byte of a block which is enough to trigger compaction into a block.
   But on the plus side this avoids issues with small holes preventing
   blocks from being formed.

   This approach brings the number of btree lookups down from
   O(crystallize_size) to 2.

2. I've gone ahead and dropped the previous scheme of coalesce_size
   + fragment_size and instead adopted a single fragment_size that
   controls the size of, well, fragments, i.e. data elements stored
   directly in trees.

   This affects both the inlined shrub as well as fragments stored in
   the inner nodes of the btree. I believe it's very similar to what is
   often called "pages" in logging filesystems, though I'm going to
   avoid that term for now because it's a bit overloaded.

   Previously, neighboring writes that, when combined, would exceed our
   coalesce_size, they just weren't combined. Now they are combined up
   to our fragment size, potentially splitting the right fragment.

   Before (fragment_size=8):

     .---+---+---+---+---+---+---+---.
     |            8 bytes            |
     '---+---+---+---+---+---+---+---'
                         +
                         .---+---+---+---+---.
                         |      5 bytes      |
                         '---+---+---+---+---'
                         =
     .---+---+---+---+---+---+---+---+---+---.
     |      5 bytes      |      5 bytes      |
     '---+---+---+---+---+---+---+---+---+---'

   After:

     .---+---+---+---+---+---+---+---.
     |            8 bytes            |
     '---+---+---+---+---+---+---+---'
                         +
                         .---+---+---+---+---.
                         |      5 bytes      |
                         '---+---+---+---+---'
                         =
     .---+---+---+---+---+---+---+---+---+---.
     |            8 bytes            |2 bytes|
     '---+---+---+---+---+---+---+---+---+---'

   This leads to better fragment alignment (much like our block
   strategy), and minimizes tree overhead.

   Any neighboring data to the right is only coalesced if it fits in the
   current fragment, or would be rewritten (carved) anyways, to avoid
   unnecessary data rewriting.

   For example (fragment_size=8):

     .---+---+---+---+---+---+---+---+---+---+---+---+---+---.
     |        6 bytes        |        6 bytes        |2 bytes|
     '---+---+---+---+---+---+---+---+---+---+---+---+---+---'
                                 +
                         .---+---+---+---+---.
                         |      5 bytes      |
                         '---+---+---+---+---'
                                 =
     .---+---+---+---+---+---+---+---+---+---+---+---+---+---.
     |            8 bytes            |    4 bytes    |2 bytes|
     '---+---+---+---+---+---+---+---+---+---+---+---+---+---'

Other than these changes this commit is mostly a bunch of carveshrub
rewriting again, which continues to be nuanced and annoying to get
bug free.
2023-10-21 22:05:46 -05:00
Christopher Haster
907c24beeb Renamed a number of things related to shrubs/trees
- -> lfsr_shrub_t
- -> lfsr_tree_t

The idea here is to adopt "shrub" as an umbrella term for the
shrub/sprout union, and "tree" as an umbrella term for the bptr/btree
union. I think this is a bit better than calling shrub/sprout "inlined"
which is a _very_ overloaded term in this codebase (inlined in the tree?
the mdir? inlined in the C struct?).
2023-10-19 21:18:44 -05:00
Christopher Haster
2940555caa Attempted to implement slice dereferencing
But already there are some pretty fundamental problems.

The main issue is that, while we correctly dereference slices during
compaction, pending commits that get delayed after compaction still
point to the old block. I'm not sure there's an easy way around this
aside from aborting compaction commits or fully simulating commits,
both of which seem too costly to implement...

Also coalescing during compaction is flawed as well, since our
attributes will be outdated by the time they are committed if there is a
compaction...

Looks like it's back to the drawing board. Either our approach to
compaction needs to change, or this slice/coalescing work needs to be
reverted/redesigned...
2023-10-19 01:05:22 -05:00
Christopher Haster
865477d7e1 Changing coalesce strategy, reimplemented shrub/btree carve
Note this is already showing better code reuse, which is a good sign,
though maybe that's just the benefit of reimplementing similar logic
multiple times.

Now both reading and carving end up in the same lfsr_btree_readnext and
lfsr_btree_buildcarve functions for both btrees and shrubs. Both btrees
and shrubs are fundamentally rbyds, so we can share a lot of
functionality as long as we redirect to the correct commit function at
the last minute. This surprising opportunity for deduplication was
noticed while putting together the dbg scripts.

Planned logic (not actual function names):

  lfsr_file_readnext -> lfsr_shrub_readnext
            |                    |
            |                    v
            '---------> lfsr_btree_readnext

  lfsr_file_flushbuffer -> lfsr_shrub_carve ------------.
            .---------------------'                     |
            v                                           v
  lfsr_file_flushshrub  -> lfsr_btree_carve -> lfsr_btree_buildcarve

Though the btree part of the above statement is only a hypothetical at
the moment. Not even the shrubs can survive compaction now.

The reason is the new SLICE tag which needs low-level support in rbyd
compact. SLICE introduces indirect refernces to data located in the same
rbyd, which removes any copying cost associated with coalescing.
Previously, a large coalesce_size risked O(n^2) runtime when
incrementally append small amounts of data, but with SLICEs we can defer
coalescing to compaction time, where the copy is effectively free.

This compaction-time-coalescing is also hypothetical, which is why our
tests are failing. But the theory is promising.

I was originally against this idea because of how it crosses abstraction
layers, requiring some very low-level code that absolutely can not be
omitted in a simpler littlefs driver. But after working on the actual
file writing code for a while I've become convinced the tradeoff is
worth it.

Note coalesce_size will likely still need to be configurable. Data in
fragmenting/sparse btrees is still susceptible to coalescing, and it's
not clear the impacts of internal fragmentation when data sizes approach
the hard block_size/2 limit.
2023-10-17 23:21:18 -05:00
Christopher Haster
fce1612dc0 Reverted to separate BTREE/BRANCH encodings, reordered on-disk structs
My current thinking is that these are conceptually different types, with
BTREE tags representing the entire btree, and BRANCH tags representing
only the inner btree nodes. We already have multiple btree tags anyways:
btrees attached to files, the mtree, and in the future maybe a bmaptree.

Having separate tags also makes it possible to store a btree in a btree,
though I don't think we'll ever use this functionality.

This also removes the redundant weight field from branches. The
redundant weight field is only a minor cost relative to storage, but it
also takes up a bit of RAM when encoding. Though measurements show this
isn't really significant.

New encodings:

  btree encoding:        branch encoding:
  .---+- -+- -+- -+- -.  .---+- -+- -+- -+- -.
  | weight            |  | blocks            |
  +---+- -+- -+- -+- -+  '                   '
  | blocks            |  '                   '
  '                   '  +---+- -+- -+- -+- -+
  '                   '  | trunk             |
  +---+- -+- -+- -+- -+  +---+- -+- -+- -+- -'
  | trunk             |  |     cksum     |
  +---+- -+- -+- -+- -'  '---+---+---+---'
  |     cksum     |
  '---+---+---+---'

Code/RAM changes:

            code          stack
  before:  30836           2088
  after:   30944 (+0.4%)   2080 (-0.4%)

Also reordered other on-disk structs with weight/size, so such structs
always have weight/size as the first field. This may enable some
optimizations around decoding the weight/size without needing to know
the specific type in some cases.

---

This change shouldn't have affected functionality, but it revealed a bug
in a dtree test, where a did gets caught in an mdir split and the split
name makes the did unreachable.

Marking this as a TODO for now. The fix is going to be a bit involved
(fundamental changes to the opened-mdir list), and similar work is
already planned to make removed files work.
2023-10-15 14:53:07 -05:00
Christopher Haster
1d5946b5ea Renamed mblocks -> mptr
Since we need an bptr type internally, a block pointer, which is a bit
more complicated than just a single address, calling our mdir pairs
mptrs makes sense.
2023-10-14 14:11:20 -05:00
Christopher Haster
173de4388b Added file tags to rendering of inner tree tags in dbglfs.py
Now -i/--inner will also show the file tags that reference the
underlying data structure.

The difference is subtle but useful:

  littlefs v2.0 0x{0,1}.eee, rev 315, weight 0.256, bd 4096x262144
  {0000,0001}:  -1.1 hello  reg 8192, btree 0x5121.d50 8143
    0000.0efc:       +          0-8142 btree w8143 11             ...
    5121.0d50:       | .-+      0-4095 block w4096 6              ...
                     | | '->    0-4095 block w4096 0x5117.0 4096  ...
                     '-+-+   4096-8142 block w4047 6              ...
                         '-> 4096-8142 block w4047 0x5139.0 4047  ...
2023-10-14 04:47:25 -05:00
Christopher Haster
fbb6a27b05 Changed crystallization strategy in btrees to rely on coalescing
This is a pretty big rewrite, but is necessary to avoid "dagging".

"Dagging" (I just made this term up) is when you transform a pure tree
into a directed acyclic graph (DAG). Normally DAGs are perfectly fine in
a copy-on-write system, but in littlefs's cases, it creates havoc for
future block allocator plans, and it's interaction with parity blocks
raises some uncomfortable questions.

How does dagging happen?

Consider an innocent little btree with a single block:

  .-----.
  |btree|
  |     |
  '-----'
     |
     v
  .-----.
  |abcde|
  |     |
  '-----'

Say we wanted to write a small amount of data in the middle of our
block. Since the data is so small, the previous scheme would simply
inline the data, carving the left and right sibling (in the case the
same block) to make space:

    .-----.
    |btree|
    |     |
    '-----'
    .' v '.
    |  c' |
    '.   .'
     v   v
    .-----.
    |ab de|
    |     |
    '-----'

Oh no! A DAG!

With the potential for multiple pointers to reference the same block in
our btree, some invariants break down:

- Blocks no longer have a single reference
- If you remove a reference you can no longer assume the block is free
- Knowing when a block is free requires scanning the whole btree
- This split operation effectively creates two blocks, does that mean
  we need to rewrite parity blocks?

---

To avoid this whole situation, this commit adopts a new crystallization
algorithm.

Instead of allowing crystallization data to be arbitrarily fragmented,
we eagerly coalesce any data under our crystallization threshold, and if
we can't coalesce, we compact everything into a block.

Much like a Knuth heap, simply checking both siblings to coalesce has
the effect that any data will always coalesce up to the maximum size
where possible. And when checking for siblings, we can easily find the
block alignment.

This also has the effect of always rewriting blocks if we are writing a
small amount of data into a block. Unfortunately I think this is just
necessary in order to avoid dagging.

At the very least crystallization is still useful for files not quite
block aligned at the edges, and sparse files. This also avoids concerns
of random writes inflating a file via sparse crystallization.
2023-10-14 01:25:41 -05:00
Christopher Haster
a81691744a Reworked lfsr_file_read a bit
- Merged lfsr_file_read_ back into lfsr_file_read, I don't think we need
  stateless reads in the end.

- Tweaked reads to use conservative hints instead of just filling all
  cache lines with whatever is in the retrieved datas.

- Switched to if/else for sprout/shrub and bptr/btree checks. Though
  this had no affect on code size, which isn't too surprising.
2023-10-14 01:25:31 -05:00
Christopher Haster
57aa513163 Tweaked debug prints to show more information during mount
Now when you mount littlefs, the debug print shows a bit more info:

  lfs.c:7881:debug: Mounted littlefs v2.0 0x{0,1}.c63 w43.256, bd 4096x256

To dissassemble this a bit:

  littlefs v2.0 0x{0,1}.c63 w43.256, bd 4096x256
            ^ ^   '-+-'  ^   ^   ^        ^   ^
            '-|-----|----|---|---|--------|---|-- major version
              '-----|----|---|---|--------|---|-- minor version
                    '----|---|---|--------|---|-- mroot blocks
                         |   |   |        |   |   (1st is active)
                         '---|---|--------|---|-- mroot trunk
                             '---|--------|---|-- mtree weight
                                 '--------|---|-- mleaf weight
                                          '---|-- block size
                                              '-- block count

dbglfs.py also shows the block device geometry now, as read from the
mroot:

  $ ./scripts/dbglfs.py disk -B4096
  littlefs v2.0 0x{0,1}.c63, rev 1, weight 43.256, bd 4096x256
  ...

This may be over-optimizing for testing, but the reason the mount debug
is only one line is to avoid slowing down/messying test output. Both
powerloss testing and remounts completely fill the output with mount
prints that aren't actually all that useful.

Also switching to prefering parens in debug info mainly for mismatched
things.
2023-10-14 01:25:26 -05:00
Christopher Haster
5ecd6d59cd Tweaked config and gstate reprs in dbglfs.py to be more readable
Mainly aligning things, it was easy for the previous repr to become a
visual mess.

This also represents the config more like how we represent other tags,
since they've changed from a monolithic config block to separate
attributes.
2023-10-14 01:25:20 -05:00
Christopher Haster
b936e33643 Tweaked dbg scripts to resize tag repr based on weight
This a compromise between padding the tag repr correctly and parsing
speed.

If we don't have to traverse an rbyd (for, say, tree printing), we don't
want to since parsing rbyds can get quite slow when things get big
(remember this is a filesystem!). This makes tag padding a bit of a hard
sell.

Previously this was hardcoded to 22 characters, but with the new file
struct printing it quickly became apparently this would be a problematic
limit:

  12288-15711 block w3424 0x1a.0 3424  67 64 79 70 61 69 6e 71  gdypainq

It's interesting to note that this has only become an issue for large
trees, where the weight/size in the tag can be arbitrarily large.

Fortunately we already have the weight of the rbyd after fetch, so we
can use a heuristic similar to the id padding:

  tag padding = 21 + nlog10(max(weight,1)+1)

---

Also dropped extra information with the -x/--device flag. It hasn't
really been useful and was implemented inconsistently. Maybe -x/--device
should just be dropped completely...
2023-10-14 01:25:14 -05:00
Christopher Haster
c8b60f173e Extended dbglfs.py to show file data structures
You can now pass -s/--structs to dbglfs.py to show any file data
structures:

  $ ./scripts/dbglfs.py disk -B4096 -f -s -t
  littlefs v2.0 0x{0,1}.9cf, rev 3, weight 0.256
  {0000,0001}:  -1.1 hello  reg 128, trunk 0x0.993 128
    0000.0993:           .->    0-15 shrubinlined w16 16     6b 75 72 65 65 67 73 63  kureegsc
                       .-+->   16-31 shrubinlined w16 16     6b 65 6a 79 68 78 6f 77  kejyhxow
                       | .->   32-47 shrubinlined w16 16     65 6f 66 75 76 61 6a 73  eofuvajs
                     .-+-+->   48-63 shrubinlined w16 16     6e 74 73 66 67 61 74 6a  ntsfgatj
                     |   .->   64-79 shrubinlined w16 16     70 63 76 79 6c 6e 72 66  pcvylnrf
                     | .-+->   80-95 shrubinlined w16 16     70 69 73 64 76 70 6c 6f  pisdvplo
                     | | .->  96-111 shrubinlined w16 16     74 73 65 69 76 7a 69 6c  tseivzil
                     +-+-+-> 112-127 shrubinlined w16 16     7a 79 70 61 77 72 79 79  zypawryy

This supports the same -b/-t/-i options found in dbgbtree.py, with the
one exception being -z/--struct-depth which is lowercase to avoid
conflict with the -Z/--depth used to indicate the filesystem tree depth.

I think this is a surprisingly reasonable way to show the inner
structure of files without clobbering the user's console with file
contents.

Don't worry, if clobbering is desired, -T/--no-truncate still dumps all
of the file content.

Though it's still up to the user to manually apply the sprout/shrub
overlay. That step is still complex enough to not implement in this
tool yet.

I
2023-10-14 01:25:08 -05:00
Christopher Haster
66e6ce4bfb Enabled no-coalescing file tests, fixed sprout->shrub transition bug
Oh hey, it's that piece of complexity I was worried about.

The problem was that the position calculation for new appended
right_data depended on left_overlap, which fell out of sync when
transitioning from sprout->shrub.

The fix here is to keep left_overlap/right_overlap up to date with the
model that the sprout->shrub transition is effectively doing a
shrub-wide rm first.

Hacky, but hopefully avoids bugs in the future by keeping all of these
variables in a reasonable state...

There may be a simpler way to think about how this code should function,
but I just can't see it. This may deserve a rewrite in the future.
2023-10-14 01:25:01 -05:00
Christopher Haster
92e1fafbc4 Merged sprout and shrub carving paths
Noticed a lot of duplicate conditions, so tried merging these two code
paths. This does risk a difficult to read/maintain function, since there
are some rather tricky subtleties with the sprout -> shrub transition.
On the other hand, the code reuse does mean less conditions to worry
about.

Merging these code paths also saves a bit of code:

           code          stack
  before: 30960           2256
  after:  30700 (-0.8%)   2256 (+0.0%)
2023-10-14 01:24:50 -05:00
Christopher Haster
addaa8fe3e Implemented data coalescing in sprout->shrub conversion
Note we still end up with a shrub, even if the file could revert back to
a sprout. This is just a simplification for the inlined file logic. We
never implicitly revert to a sprout.
2023-10-14 01:22:54 -05:00
Christopher Haster
e43b4c7d9a Implemented data coalescing in carveinlined, though it is a bit hacky
The hacky part is how we interact with the scratch datas array in
multiple places. This code isn't generalizable.
2023-10-14 01:22:27 -05:00
Christopher Haster
da5b6c0751 Reworked lfsr_file_carveinlined a bit, prefer no rm tag where possible
This mostly figures out how things might work with coalescing, without
fully implementing coalescing yet.

One thing noteworthy, previously when carving right data, we would
remove the right data and rewrite it. This was to accomidate implicit
splits:

  buf:         [bbbb]
  shrub:     [llrrrrrr]

  1. rm      [ll]
  2. append  [llrr]
  3. append  [llbbbbrr]

An implicit split being when the left sibling and right sibling are the
same data

  buf:         [bbbb]
  shrub:     [llllllll]

  1. carve   [ll]
  2. append  [llbbbb]
  3. append  [llbbbbll]

By separating out the split logic, this rm can be avoided:

  buf:         [bbbb]
  shrub:     [llrrrrrr]

  1. carve   [llrr]
  2. append  [llbbbbrr]

At the cost of making our implicit split have more steps (in code),
though, I believe it does have less subtle/more understandable behavior:

  buf:         [bbbb]
  shrub:     [llllllll]

  1. carve   [ll]
  2. append  [llll]
  3. append  [llbbbbll]

As a plus, we avoid looking up the same sibling twice when doing
implicit splits.
2023-10-14 01:14:56 -05:00
Christopher Haster
aa64c85317 Deduplicated shrub updates into lfsr_file_carveinlined
lfsr_file_carveinlined writes data into a shrub, while handling both the
carving logic of data we might be overlapping, and any hole logic we
need to fill out the tree.

This provides a nice, relatively simple but flexible, operation for all
shrub updates:

  static int lfsr_file_carveinlined(lfs_t *lfs, lfsr_file_t *file,
          lfs_off_t pos, lfs_off_t weight, lfs_soff_t delta,
          lfsr_data_t data);

I'm quite happy with how these internal carveinlined/carvebtree
functions are coming together. It's nice to have all of that logic in
one place, even if it's a bit complex.
2023-10-14 01:14:41 -05:00
Christopher Haster
39f417db45 Implemented a filesystem traversal that understands file bptrs/btrees
Ended up changing the name of lfsr_mtree_traversal_t -> lfsr_traversal_t,
since this behaves more like a filesytem-wide traversal than an mtree
traversal (it returns several typed objects, not mdirs like the other
mtree functions for one).

As a part of this changeset, lfsr_btraversal_t (was lfsr_btree_traversal_t)
and lfsr_traversal_t no longer return untyped lfsr_data_ts, but instead
return specialized lfsr_{b,t}info_t structs. We weren't even using
lfsr_data_t for its original purpose in lfsr_traversal_t.

Also changed lfsr_traversal_next -> lfsr_traversal_read, you may notice
at this point the changes are intended to make lfsr_traversal_t look
more like lfsr_dir_t for consistency.

---

Internally lfsr_traversal_t now uses a full state machine with its own
enum due to the complexity of traversing the filesystem incrementally.

Because creating diagrams is fun, here's the current full state machine,
though note it will need to be extended for any
parity-trees/free-trees/etc:

  mrootanchor
       |
       v
  mrootchain
  .-'  |
  |    v
  |  mtree ---> openedblock
  '-. | ^           | ^
    v v |           v |
   mdirblock    openedbtree
      | ^
      v |
   mdirbtree

I'm not sure I'm happy with the current implementation, and eventually
it will need to be able to handle in-place repairs to the blocks it
sees, so this whole thing may need a rewrite.

But in the meantime, this passes the new clobber tests in test_alloc, so
it should be enough to prove the file implementation works. (which is
definitely is not fully tested yet, and some bugs had to be fixed for
the new tests in test_alloc to pass).

---

Speaking of test_alloc.

The inherent cyclic dependency between files/dirs/alloc makes it a bit
hard to know what order to test these bits of functionality in.

Originally I was testing alloc first, because it seems you need to be
confident in your block allocator before you can start testing
higher-level data structures.

But I've gone ahead and reversed this order, testing alloc after
files/dirs. This is because of an interesting observation that if alloc
is broken, you can always increase the test device's size to some absurd
number (-DDISK_SIZE=16777216, for example) to kick the can down the
road.

Testing in this order allows alloc to use more high-level APIs and
focus on corner cases where the allocator's behavior requires subtlety
to be correct (e.g. ENOSPC).
2023-10-14 01:13:40 -05:00
Christopher Haster
881c46f562 Tweaked lfsr_mtree_traversal_next to no longer write the mtree/mroot
This was a cludge due to needing lfs->mtree initialized to traverse the
mtree, the assumption being that future traversals should strictly
update the mtree/mroot to the existing state.

Moving code around (and adopting an actual state machine, which will be
needed for btree traversal) made this no longer necessary.

Now the mtree/mroot is only initialized in lfsr_mountinited, as it
should be.
2023-10-14 01:13:33 -05:00
Christopher Haster
4996b8419d Implemented most of file btree reading/writing
Still needs testing, though the byte-level fuzz tests were already causing
blocks to crystallize. I noticed this because of test failures which are
fixed now.

Note the block allocator currently doesn't understand file btrees. To
get the current tests passing requires -DDISK_SIZE=16777216 or greater.

It's probably also worth noting there's a lot that's not implemented
yet! Data checksums and write validation for one. Also ecksums. And we
should probably have some sort of special handling for linear writes so
linear writes (the most common) don't end up with a bunch of extra
crystallizing writes.

Also the fact that btrees can become DAGs now is an oversight and a bit
concerning. Will that work with a closed allocator? Block parity?
2023-10-14 01:12:26 -05:00
Christopher Haster
1e13124091 Tweaked LFS_ASSERT impl to use __builtin_unreachable
First, realized the the LFS_UNREACHABLE logic was flipped after a
confusing test bug (damn double negatives). But also realized LFS_ASSERT
could be tweaked to "call" __builtin_unreachable() on assert failure to
act as a sort of compiler hint.

Turns out this hint saves a little bit of code, note both builds have
LFS_UNREACHABLE fixed:

                                   code          stack
  without __builtin_unreachable:  28408           1928
  with __builtin_unreachable:     28324 (-0.3%)   1920 (+0.0%)

Since __builtin_unreachable is a compiler extension, its usage respects
LFS_NO_INTRINSICS.
2023-10-14 01:11:51 -05:00
Christopher Haster
07e977bb43 Progress on file btrees
Added lfsr_bptr_t to represent block pointers (maybe we should rename
mblocks back to mptr), added fetching of btrees/bptrs in
lfsr_file_opencfg, added estimate tracking to our shrubs so we actually
know when to create a btree, and implemented most of the high-level
btree logic.

It's not working yet, but the biggest idea introduced here is how we
handle block alignment.

See, we really don't want awkward btree topologies to form where small
amounts of data get stuck between blocks:

  .-----.--.-----.
  |     |  |     |
  |     |  |     |
  '-----'--'-----'

This is wasteful, as the middle bit of data either gets represented as a
full block with its data partially covered, or as data inlined in the
btree, which comes with ~2x overhead.

The solution here is to scan for a block on either the left or right to
derive our block alignment from.

Unfortunately, since our sibling blocks could have been carved, this
requires scanning all the way from pos-2*B+1 to pos+2*B-1, a total of
4*B-2, to make sure we find a sibling if there is one.

  worst case left  worst case right
   .-----.-----.    .-----.-----.
   | xxxx|     |    |p    |xxxxx|
   |xxxxx|    p|    |     |xxxx |
   '-----'-----'    '-----'-----'
    '----+----'      '----+----'
    pos-2*bs+1       pos+2*bs-1

Fortunately, at this stage, data should have had many chances to
coalesce, so hopefully the actual scan overhead should be much smaller
in practice.

Writing data to a file linearly, for example, only needs a single lookup
to find the previous block.
2023-10-14 01:09:45 -05:00
Christopher Haster
52113c6ead Moved the test/bench runner path behind an optional flag
So now instead of needing:

  ./scripts/test.py ./runners/test_runner test_dtree

You can just do:

  ./scripts/test.py test_dtree

Or with an explicit path:

  ./scripts/test.py -R./runners/test_runner test_dtree

This makes it easier to run the script manually. And, while there may be
some hiccups with the implicit relative path, I think in general this will
make the test/bench scripts easier to use.

There was already an implicit runner path, though only if the test suite
was completely omitted. I'm not sure that would ever have actually
been useful...

---

Also increased the permutation field size in --list-*, since I noticed it
was overflowing.
2023-10-14 00:54:28 -05:00
Christopher Haster
df32211bda Changed -t/--dtree to -f/--files in dbglfs.py
This flag makes more sense to me and avoids conflicts with the
-d/--delta flag used for gstate.
2023-10-14 00:54:06 -05:00
Christopher Haster
a2aa25aa8e Tweaked dbgrbyd.py to show -1 tag rids 2023-10-14 00:53:31 -05:00
Christopher Haster
8c0f99890d Tweaked appendattrs to not need to save changes to rid_ 2023-10-14 00:52:18 -05:00
Christopher Haster
ef691d4cfe Tweaked rbyd lookup/append to use 0 lower rid bias
Previously our lower/upper bounds were initialized to -1..weight. This
made a lot of the math unintuitive and confusing, and it's not really
necessary to support -1 rids (-1 rids arise naturally in order-statistic
trees the can have weight=0).

The tweak here is to use lower/upper bounds initialized to 0..weight,
which makes the math behave as expected. -1 rids naturally arise from
rid = upper-1.
2023-10-14 00:52:00 -05:00
Christopher Haster
501f8cbe10 Implemented lfsr_file_fruncate
This is an exciting new function, made possible by the order-statistic
nature of our rbyds and btrees.

lfsr_file_fruncation is like truncate, but from the front. It can trim
data off of the front of files, and grow files from the front,
effectively prefixing files with zeros cheaply.

This may have some niche use cases for prefixing files with headers, but
the real killer is making logging files trivial. Up until now logging
into a file has always resulted in awkward file-swapping code when a
file gets full. Now maintaining a log is just a single fruncate call.

---

Implementation wise, lfsr_file_fruncate is very similar to
lfsr_file_truncate, except we need to always inject holes into all file
trees to adjust file contents correctly.
2023-10-14 00:51:26 -05:00
Christopher Haster
5adc1f54b7 Implemented and tested lfsr_file_truncate
Not much to say here. We need to modify trees a bit, but at least it's
relatively straightforward.
2023-10-14 00:45:32 -05:00
Christopher Haster
981e64f524 Added more seek tests, fixed some annoying POSIX/etc subtleties
What do you think a file's size becomes when you:

1. seek past the end of a file
2. call write with zero data!

POSIX/etc has this case explicitly mentioned, noting that zero-sized
writes should never update the file size.

This clashes with the assumption that file writes always update the file
position, but I suppose it makes a bit of practical sense if you want
zero-sized file writes to be idempotent.
2023-10-14 00:38:49 -05:00
Christopher Haster
0638b09d18 Switched to using mid to tell which files belong in a compaction
This avoids the previous issues with block state for null inlined data,
and we're already testing the rid anyways for splits.

In theory we don't need the block for inlined data at all, but it is
convenient as it allows us to use the existing internal rbyd/data APIs
without needing to move data around. Though it may be worth looking into
alternative layouts at some point.
2023-10-14 00:33:55 -05:00
Christopher Haster
69993da7e1 Small cleanup of inlined compaction update conditions
This deduplicates quite a bit of logic which is very satisfying.

It could be even better if the block field was located in the same place
for both sprouts and shrubs...
2023-10-14 00:33:04 -05:00
Christopher Haster
a6357e8a5c Renamed test_ftree->files, added fuzz tests, fixed a bug
The bug was a simple miscalculation on how much data to truncate when
carving a left-neighbor that also has a hole.
2023-10-14 00:31:08 -05:00
Christopher Haster
cbbd77708d Actually made the previous commit work
The logic behind relying on pre-commit inlined state to clear any failed
commits was sound, but built on the wrong assumption that file->inlined
would always contain the mdir's block. This was not true for
null-inlined, i.e. no inlined data, since this doesn't really live
anywhere.

Changed file's inlined state to track the mdir block, even when we have
no inlined data. A bit redundant, but a nice invariant to rely on in
lfsr_mdir_compact__.

This invariant also only affects lfsr_mdir_compact__, since this is the
only place inlined data can change blocks.
2023-10-14 00:29:22 -05:00
Christopher Haster
58be838916 Tweaked compact to use pre-commit inlined state
This seems more correct and avoids an extra set of inlined state copies.

Win win.
2023-10-14 00:28:56 -05:00
Christopher Haster
488ba4b650 Fixed mdir estimate during compaction to include shrubs
Once again another function we need to nearly-completely duplicate
thanks to the recursive nature of our shrubs.

I wasn't planning to test this at this stage, but it turns out
byte-level syncs quickly fill up mdirs, triggering early ERANGE asserts
unless we split.

A 32-byte, byte-level synced, shrub already takes up 928 bytes when
including tree overhead, 1856 bytes if you include the unsynced
copy, which is very close to the 2048 byte threshold for splitting
4KiB blocks.
2023-10-14 00:20:25 -05:00
Christopher Haster
b008c2af75 Fixed bug where pre-compact commit clobbered inlined files, other tweaks
We were not properly resetting the staged shrub in lfsr_mdir_commit__,
well, we were sometimes, but only when transitioning from a sprout to a
shrub.

Also tweaked the mdir commit logic to try to only use the staging
inlined state. This just simplifies how much state needs to be
considered when debugging and may result in less data fetches.
2023-10-14 00:13:16 -05:00
Christopher Haster
edc4cb2fa9 Changed TEST_PLS to track number of powerlosses seen by the current test
This turned out to have limited use for the tests themselves. I was
hoping to avoid the mount->format->mount fallback when powerloss
testing, but we still need it in case format was interrupted.

Still, TEST_PLS is very useful for debugging.

Previouly it was difficult to set a breakpoint at a specific location,
and after a specific powerloss event. Now all you need is this in gdb:

  b <line> if test_pls == <pls>
2023-10-14 00:11:20 -05:00
Christopher Haster
582dc5f1b2 Added some tests, quick seek impl, fixed bugs
Turns out it's hard to test file holes without seek.

It's interesting to note most of seek's buffer flush work actually
occurs lazily in lfsr_file_write, so lfsr_file_seek turns out to be a
relatively simple function.
2023-10-14 00:09:27 -05:00
Christopher Haster
0724b9a8c4 Really revamped flushbuffer, now leveraging overwriting grow tags
I had completely forgotten about overwriting grow tags, that is tags
that both change the attr's weight while also changing the tag itself.
2023-10-14 00:06:55 -05:00
Christopher Haster
c2d33a1843 Reworked btree-commit/flushbuffer to incrementally build attrs
This basically turns these functions into tiny bounded compilers, which
is interesting to think about. I wonder if this sort of evolution led to
how queries are compiled in modern databases.

This method of attr generation is both easier to use and more flexible.

It also saves some code, but note lfsr_file_flushbuffer underwent
significant tweaking leveraging this, so the actual code savings are a
bit muddy:

            code          stack
  before:  25672           2024
  after:   25452 (-0.9%)   1920 (-5.4%)
2023-10-14 00:01:00 -05:00
Christopher Haster
dc8dce8f0c Introduced coalesce_size and crystallize_size, deduplicated test cfg
- coalesce_size - The amount of data allowed to coalesce into single
  data entries.

- crystallize_size - How much data is allowed to be written to btree
  inner nodes before needing to be compacted into a block.

Also deduplicated the test config is something I've been wanting to do
for a while. It doesn't make sense to need to modify several different
instantiations of lfs_config every time a config option is added or
removed...
2023-10-13 23:56:33 -05:00
Christopher Haster
2b950bb16b Reworked flushbuffer logic to merge neighboring pieces of data
This gets pretty ugly and mainly just involves a lot of subtle range
logic.

Our CAT data representation really shines here, but all of the scratch
datas do come with a code/ram cost:

            code          stack
  before:  25448           1920
  after:   25672 (+0.9%)   2024 (+5.1%)
2023-10-13 23:48:54 -05:00
Christopher Haster
4334a848a3 Tweaked mdir commit so it handles all inlined file staging
This saves a bit of code:

            code          stack
  before:  25552           1920
  after:   25448 (-0.4%)   1920 (+0.0%)

But more importantly, this simplifies things and moves all of the
staging/updating logic into lfsr_mdir_commit, where most of the
subtle post-compaction interactions play out.
2023-10-13 23:46:17 -05:00
Christopher Haster
02ae6050de Changed lfsr_data_t internals, added LFSR_DATA_CAT
The main purpose of this change is to introduce LFSR_DATA_CAT, a
generalized way to concatenated various data references internally.

As a side-effect lfsr_data_t has been completely restructured. Now,
lfsr_data_t can be in one of 4 modes:

If the size field's sign bit=0, the lfsr_data_t points in-device. A new,
count field, determines the encoding:

  sign(size)=0, count=0 => inlined:

    .---+---+---+---.
    |     size      |
    |---+---+---+---|
    |c=0| inlined d |  note inlined data is just enough to hold
    |---+           |  one encoded leb128
    | ata...        |
    '---------------'

  sign(size)=1, count=1 => direct:

    .---+---+---+---.   .---+---+---+---.
    |     size      | .>| data...       |
    |---+---+---+---| | |       .       |
    |c=1|           | | .       .       .
    |---+---+---+---| | .       .       .
    | direct ptr -----' .               .
    '---------------'

  sign(size)=1, count>=2 => indirect:

    .---+---+---+---.   .---+---+---+---.   .---+---+---+---.
    |     size      | .>|     size      | .>| data...       |
    |---+---+---+---| | |---+---+---+---| | |       .       |
    |c>1|           | | |c=1|           | | .       .       .
    |---+---+---+---| | |---+---+---+---| | .       .       .
    | indirect ptr ---' | direct ptr -----' .               .
    '---------------'   '---------------'   .---+---+---+---.
                        |     size      | .>| data...       |
                        |---+---+---+---| | |       .       |
                        |c=1|           | | .       .       .
                        |---+---+---+---| | .       .       .
                        | direct ptr -----' .               .
                        '---+---+---+---'
                        |       .       |
                        |       .       |
                        .       .       .
                        .               .
                        .               .

  note only one indirect layer is allowed due to no recursion

If the size field's sign bit=1, the lfsr_data_t points on-disk:

  sign(size)=0 => on-disk:

    .---+---+---+---.          .....
    |     size      |      ..''     ''..
    |---+---+---+---|     :    :        :
    |     block ------+->|            ..:|
    |---+---+---+---| |  |......( )::::::|
    |      off -------'  |:::'    :      |
    '---------------'     :'       :    :
                           ''..     :.''
                               '''''

My goal with this commit was to test the new implementation and see how
it would impact code/RAM size before adopting it in the actual file
handling code, and the results are... not great...

            code          stack
  before:  24668           1840
  after:   25552 (+3.5%)   1920 (+4.2%)

I think most of the new cost comes from the now correct handling of
read/cmp with concatentated datas, which previously would just assert.
This change gives us LFSR_DATA_CAT, so I will be working with it for
now, but this may be worth looking at again in the future. Maybe the
correct handling of read/cmp should just be reverted to an assert...
2023-10-13 23:45:41 -05:00