Commit Graph

103 Commits

Author SHA1 Message Date
Christopher Haster
240fe4efe4 Changed CKSUM suptype encoding from 0x2000 -> 0x3000
I may be overthinking things, but I'm guessing of all the possible tag
modes we may want to add in the future, we will mostly like want to add
something that looks vaguely tag like. Like the shrub tags, for example.

It's beneficial, ordering wise, for these hypothetical future tags to
come before the cksum tags.

Current tag modes:

  0x0ttt  v--- tttt -ttt tttt  normal tags
  0x1ttt  v--1 tttt -ttt tttt  shrub tags
  0x3tpp  v-11 tttt ---- ---p  cksum tags
  0x4kkk  v1dc kkkk -kkk kkkk  alt tags
2023-10-24 23:46:11 -05:00
Christopher Haster
1fc2f672a2 Tweaked tag encoding a bit post-slice to make space for becksum tags 2023-10-24 22:34:21 -05:00
Christopher Haster
35434f8b54 Removed remnants of slice code, and cleaned things up a bit 2023-10-24 22:26:08 -05:00
Christopher Haster
865477d7e1 Changing coalesce strategy, reimplemented shrub/btree carve
Note this is already showing better code reuse, which is a good sign,
though maybe that's just the benefit of reimplementing similar logic
multiple times.

Now both reading and carving end up in the same lfsr_btree_readnext and
lfsr_btree_buildcarve functions for both btrees and shrubs. Both btrees
and shrubs are fundamentally rbyds, so we can share a lot of
functionality as long as we redirect to the correct commit function at
the last minute. This surprising opportunity for deduplication was
noticed while putting together the dbg scripts.

Planned logic (not actual function names):

  lfsr_file_readnext -> lfsr_shrub_readnext
            |                    |
            |                    v
            '---------> lfsr_btree_readnext

  lfsr_file_flushbuffer -> lfsr_shrub_carve ------------.
            .---------------------'                     |
            v                                           v
  lfsr_file_flushshrub  -> lfsr_btree_carve -> lfsr_btree_buildcarve

Though the btree part of the above statement is only a hypothetical at
the moment. Not even the shrubs can survive compaction now.

The reason is the new SLICE tag which needs low-level support in rbyd
compact. SLICE introduces indirect refernces to data located in the same
rbyd, which removes any copying cost associated with coalescing.
Previously, a large coalesce_size risked O(n^2) runtime when
incrementally append small amounts of data, but with SLICEs we can defer
coalescing to compaction time, where the copy is effectively free.

This compaction-time-coalescing is also hypothetical, which is why our
tests are failing. But the theory is promising.

I was originally against this idea because of how it crosses abstraction
layers, requiring some very low-level code that absolutely can not be
omitted in a simpler littlefs driver. But after working on the actual
file writing code for a while I've become convinced the tradeoff is
worth it.

Note coalesce_size will likely still need to be configurable. Data in
fragmenting/sparse btrees is still susceptible to coalescing, and it's
not clear the impacts of internal fragmentation when data sizes approach
the hard block_size/2 limit.
2023-10-17 23:21:18 -05:00
Christopher Haster
fce1612dc0 Reverted to separate BTREE/BRANCH encodings, reordered on-disk structs
My current thinking is that these are conceptually different types, with
BTREE tags representing the entire btree, and BRANCH tags representing
only the inner btree nodes. We already have multiple btree tags anyways:
btrees attached to files, the mtree, and in the future maybe a bmaptree.

Having separate tags also makes it possible to store a btree in a btree,
though I don't think we'll ever use this functionality.

This also removes the redundant weight field from branches. The
redundant weight field is only a minor cost relative to storage, but it
also takes up a bit of RAM when encoding. Though measurements show this
isn't really significant.

New encodings:

  btree encoding:        branch encoding:
  .---+- -+- -+- -+- -.  .---+- -+- -+- -+- -.
  | weight            |  | blocks            |
  +---+- -+- -+- -+- -+  '                   '
  | blocks            |  '                   '
  '                   '  +---+- -+- -+- -+- -+
  '                   '  | trunk             |
  +---+- -+- -+- -+- -+  +---+- -+- -+- -+- -'
  | trunk             |  |     cksum     |
  +---+- -+- -+- -+- -'  '---+---+---+---'
  |     cksum     |
  '---+---+---+---'

Code/RAM changes:

            code          stack
  before:  30836           2088
  after:   30944 (+0.4%)   2080 (-0.4%)

Also reordered other on-disk structs with weight/size, so such structs
always have weight/size as the first field. This may enable some
optimizations around decoding the weight/size without needing to know
the specific type in some cases.

---

This change shouldn't have affected functionality, but it revealed a bug
in a dtree test, where a did gets caught in an mdir split and the split
name makes the did unreachable.

Marking this as a TODO for now. The fix is going to be a bit involved
(fundamental changes to the opened-mdir list), and similar work is
already planned to make removed files work.
2023-10-15 14:53:07 -05:00
Christopher Haster
b936e33643 Tweaked dbg scripts to resize tag repr based on weight
This a compromise between padding the tag repr correctly and parsing
speed.

If we don't have to traverse an rbyd (for, say, tree printing), we don't
want to since parsing rbyds can get quite slow when things get big
(remember this is a filesystem!). This makes tag padding a bit of a hard
sell.

Previously this was hardcoded to 22 characters, but with the new file
struct printing it quickly became apparently this would be a problematic
limit:

  12288-15711 block w3424 0x1a.0 3424  67 64 79 70 61 69 6e 71  gdypainq

It's interesting to note that this has only become an issue for large
trees, where the weight/size in the tag can be arbitrarily large.

Fortunately we already have the weight of the rbyd after fetch, so we
can use a heuristic similar to the id padding:

  tag padding = 21 + nlog10(max(weight,1)+1)

---

Also dropped extra information with the -x/--device flag. It hasn't
really been useful and was implemented inconsistently. Maybe -x/--device
should just be dropped completely...
2023-10-14 01:25:14 -05:00
Christopher Haster
c8b60f173e Extended dbglfs.py to show file data structures
You can now pass -s/--structs to dbglfs.py to show any file data
structures:

  $ ./scripts/dbglfs.py disk -B4096 -f -s -t
  littlefs v2.0 0x{0,1}.9cf, rev 3, weight 0.256
  {0000,0001}:  -1.1 hello  reg 128, trunk 0x0.993 128
    0000.0993:           .->    0-15 shrubinlined w16 16     6b 75 72 65 65 67 73 63  kureegsc
                       .-+->   16-31 shrubinlined w16 16     6b 65 6a 79 68 78 6f 77  kejyhxow
                       | .->   32-47 shrubinlined w16 16     65 6f 66 75 76 61 6a 73  eofuvajs
                     .-+-+->   48-63 shrubinlined w16 16     6e 74 73 66 67 61 74 6a  ntsfgatj
                     |   .->   64-79 shrubinlined w16 16     70 63 76 79 6c 6e 72 66  pcvylnrf
                     | .-+->   80-95 shrubinlined w16 16     70 69 73 64 76 70 6c 6f  pisdvplo
                     | | .->  96-111 shrubinlined w16 16     74 73 65 69 76 7a 69 6c  tseivzil
                     +-+-+-> 112-127 shrubinlined w16 16     7a 79 70 61 77 72 79 79  zypawryy

This supports the same -b/-t/-i options found in dbgbtree.py, with the
one exception being -z/--struct-depth which is lowercase to avoid
conflict with the -Z/--depth used to indicate the filesystem tree depth.

I think this is a surprisingly reasonable way to show the inner
structure of files without clobbering the user's console with file
contents.

Don't worry, if clobbering is desired, -T/--no-truncate still dumps all
of the file content.

Though it's still up to the user to manually apply the sprout/shrub
overlay. That step is still complex enough to not implement in this
tool yet.

I
2023-10-14 01:25:08 -05:00
Christopher Haster
ef691d4cfe Tweaked rbyd lookup/append to use 0 lower rid bias
Previously our lower/upper bounds were initialized to -1..weight. This
made a lot of the math unintuitive and confusing, and it's not really
necessary to support -1 rids (-1 rids arise naturally in order-statistic
trees the can have weight=0).

The tweak here is to use lower/upper bounds initialized to 0..weight,
which makes the math behave as expected. -1 rids naturally arise from
rid = upper-1.
2023-10-14 00:52:00 -05:00
Christopher Haster
3fb4350ce7 Updated dbg scripts to support shrub trees
- Added shrub tags to tagrepr
- Modified dbgrbyd.py to use last non-shrub trunk by default
- Tweaked dbgrbyd's log mode to find maximum seen weight for id padding
2023-10-13 23:35:03 -05:00
Christopher Haster
2f38822820 Still missing quite a bit, but rudimentary inlined-trees are now working
And by working, I mean you can create inlined trees, just don't
compact/split/move/etc anything. But this does outline the path files
take when writing buffers into inlined trees.

"Inlined trees" in littlefs are entire small rbyd trees embedded as
secondary trees in an mdir's main rbyd tree. When fetching, we can
indicate if a given trunk belongs to the main tree or secondary tree by
setting one of the unused mode bits in the trunk's tag, now called the
"deferred" bit. This bit doesn't need to be included in the alt's "key"
field, so there's no issue with it conflicting with the alt's mode bits.

This requires a bit of tweaking lfsr_rbyd_fetch, since it needs to fall
back to the previous trunk if it discovers the most recent trunk belongs
to an inlined tree. But as a benefit we can leverage the full power of
rbyds in inlined files, including holes, partial updates, etc.

One downside is it looks like these inlined trees may involve more work
in maintining their state correctly, since they need to be sort of
"brought along" when mdirs are compacted, even if they don't actually
have a reference in the mdir yet. But the sheer amount of flexibility
this gives inlined files may make this overhead worth it.
2023-10-13 23:11:35 -05:00
Christopher Haster
dd6a4e6496 Dropped the header from dbg scripts
I had never noticed xxd has no header until comparing its output against
dbgblock.py. Turns out these headers aren't really all that useful, and
even sometimes wrong in dbglfs.py.
2023-09-15 17:45:17 -05:00
Christopher Haster
c1fe64314c Reworked how filesystem-level config is stored
Now, instead of storing a single contiguous block of config data, config
is stored as tagged metadata like any other attribute.

This allows more flexibility towards adding/removing config in the
future, without cluttering up the config with deprecated entries (see
ATA's "IDENTIFY DEVICE" response).

Most of the config entries are single leb128 limits on various integer
types, with the exception of the magic string and version (major/minor
pair).

---

Note this also includes some semantic changes to the config:

- Limits are stored as size-1. This avoid issues with integer overflow
  at extreme ranges.

  This was also adopted for block size (block limit) and block count
  (disk limit). This deviation between on-disk config and user-facing
  config risks confusion, but allows the potential for the full 2^31 range
  for these values.

- The default cksum type, crc32c, has been changed to 0.

  Originally this was 2 to allow the type to map to the crc width for
  crc8, crc16, crc32c, crc64, etc. But dropping this idea and numbering
  checksums as they are implemented simplifies things.

  May come back to this.

- Storing these configs as attributes opens up of the option of on-disk
  defaults when configs are missing.

  I'm being a bit conservative with this one, as it's not clear to me if
  we should prefer default configs (less code/storage, risk of untested
  config parsing) or prefer explicit on-disk configs.

  Currently the following have defaults since they seem the most obvious
  to me:

  - cksum type  => defaults to crc32c
  - redund type => defaults to parity (TODO, should this default to
    no redund?)
  - utag_limit  => defaults to 0x7f (no special tag decoding)
  - uattr_limit => defaults to block_limit (implicit)
2023-09-15 14:51:25 -05:00
Christopher Haster
f7900edc1c Updated dbg scripts with changes, adopted mbid.mrid in debug prints
This format for mids is a compromise in readability vs debugability.

For example, if our mbid weight is 256 (4KiB blocks), the 19th entry
in the second mdir would be the raw integer 275. With this mid format,
we would print it as 256.19.

The idea is to make it easy to see it's the 19th entry in the mdir while
still making it relatively easy to see that 256.19 and 275 are
equivalent when debugging.

---

The scripts also took some tweaking due to the mid change. Tried to keep
the names consistent, but I don't think it's worthwhile to change too
much of the scripts while they are working.
2023-09-05 10:10:30 -05:00
Christopher Haster
256430213d Dropped separate BTREE/BRANCH encodings
There is a bit of redundancy here, as we already know the weights of
btree's inner-branches from their parents. But in theory sharing the
same encoding for both the top level btree reference and inner-branches
should offer more chance for deduplication and hopefully less code.

This also moves some members around in the btree encoding so that the
redund blocks are at the beginning. This _might_ simplify decoding of
the variable-length redund blocks at some point.

Current btree encoding:

  .----+----+----+----.
  |       blocks    ...  redund leb128s (1-20 bytes)
  :                   :
  |----+----+----+----|
  |       trunk     ...  1 leb128 (1-5 bytes)
  |----+----+----+----|
  |       weight    ...  1 leb128 (1-5 bytes)
  |----+----+----+----|
  |       cksum       |  1 le32 (4 bytes)
  '----+----+----+----'

This also partially reverts some tag name changes:

- BNAME -> BRANCH
- DMARK -> BOOKMARK
2023-08-22 13:20:37 -05:00
Christopher Haster
314c832588 Adopted new struct encoding scheme with redund tag bits
Struct tags, in littlefs, generally encode pointers to different on-disk
data structures. At this point, they've gotten a bit complex, with the
btree struct, for example, containing 1. a block address, 2. the trunk
offset, 3. the weight of the trunk, and 4. a checksum.

Also some future plans:

1. Block redundancy will make it so these pointers may have a variable
   number of block addresses to contend with.

2. Different checksum types may make the checksum field itself variable
   length, at least on larger builds of littlefs.

   This may also happen if we support truncated checksums in littlefs
   for storage saving reasons.

Having two variable sized fields becomes a bit of a pain. We can use the
encoded tag size to figure out the size of one of these fields, but not
both.

The change here makes it so the tag size now determines the checksum
size, requiring the redundancy amount to go somewhere else. This makes
it so checksums can be variably sized, and the explicit redundancy
amount avoids the need to parse the leb128s fully to know how many
blocks we're expecting.

But where to put the redundancy amount?

This commit carves out 2-bits from the struct tag to store the amount of
redundancy to allow up to 3 blocks of redundancy:

  v0000011 0TTTTTrr
  ^--^---^-^----^-^- valid bit
     '---|-|----|-|- 3-bit mode (0x0 for structs)
         '-|----|-|- 4-bit suptype (0x3 for structs)
           '----|-|- 0 bit (reserved for leb128)
                '-|- 5-bit subtype
                  '- 2-bit redund

3 blocks may sound extremely limiting, but it's a common limit for
filesystems, 1. because you have to keep in mind each redundant block
adds that much more writing/reading overhead and 2. the fact
that 2^(2^n)-1 is always divisible by 3 makes >3 parity blocks much more
complicated mathematically.

Worst case, if we ever have >3 redundant blocks, we can create new
struct subtypes. Maybe adding extended struct types that prefix the
block addresses with a leb128 encoding the redundancy amount.

---

As a part of this, reorganized the on-disk btree and ecksum encodings to
put the checksum last.

Also split out the btree and inner btree branches as separate struct
types. The btree includes the weight, whereas the weight is implicit in
inner btree branches. This came about after realizing context-specific
prefixes are relatively easy to add thanks to the composability of our
parsers.

This led to some name collisions though:

- BRANCH   -> BNAME
- BOOKMARK -> DMARK
2023-08-11 12:55:48 -05:00
Christopher Haster
d2f2b53262 Renamed fcksum -> ecksum
This checksum is used to keep track of if we have erased, and not yet
touched, the unused bytes trailing our current commit in the rbyd.

The working theory is that if any prog attempt is made, it will, most
likely, change the checksum of the contents, allowing littlefs to
determine if trailing erased-state is safe to use, even under powerloss.
littlefs can also perturb future data by a single bit, to force this
checksum to always be invalidated during normal operation.

The original name, "forward erased-state checksums (fcksum)", came from the
idea that the checksum "looks forward" into the next commit.

But after using them for a bit, I think the name is unnecessarily
confusing. It, uh, also looks a lot like a swear word. I think
shortening the name to just "erased-state checksums (ecksum)", even
though the previous name is already in use in  a release, is reasonable.

---

It's probably hard to believe but the name change from fcrc -> ecrc
really was unrelated to the crc -> cksum change. But boy is it
convenient for avoiding an awkward name. A lot of these name changes
involved sed scripts, so I didn't notice how awkward fcksum would be to
use until writing this commit message.
2023-08-07 14:34:47 -05:00
Christopher Haster
7031d6e1b3 Changed most references to crc/csum -> cksum
The reason for this is to move away from the idea that littlefs is
strictly bound to CRCs and make the code more welcoming to other
checksum types, such as SHA256, etc.

Of course, changing the name doesn't really do anything. littlefs
actually _is_ strictly bound to CRCs in a couple ways that other
filesystems aren't. These would need to have workarounds for other
checksum types:

- We leverage the parity-preserving nature of (some) CRCs to not have
  to also calculate the parity of metadata in rbyd commits.

- We leverage the linearity of CRCs to retroactively flip the
  perturb bit in the cksum tag without needing to recalculate the
  checksum. Though the fact we need to do this is because of how we
  use parity above, so this may just not be needed for non-CRC
  checksums.

- The plans for global-CRCs (not yet implemented) rely heavily on the
  mathematical properties of CRC polynomials. This doesn't mean
  global-CRCs can't work with other checksums, you would just need to
  find a different type of polynomial.
2023-08-07 14:18:37 -05:00
Christopher Haster
d77a173d5c Changed source to consistently use rid for rbyd ids
Originally it made sense to name the rbyd ids, well, ids, at least in
the internals of the rbyd functions. But this doesn't work well outside
of the rbyd code, where littlefs has to juggle several different id
types with different purposes:

- rid => rbyd-id, 31-bit index into an rbyd
- bid => btree-id, 31-bit index into a btree
- mid => mdir-id, 15-bit+15-bit index into the mtree
- did => directory-id, 31-bit unique identifier for directories

Even though context makes it clear which id the id refers to in the rbyd
internals, updating the name to rid makes it clearer that these are the
same type of id when looking at code both inside and outside the rbyd
functions.
2023-08-07 14:10:09 -05:00
Christopher Haster
64a1b46ea2 Renamed a couple directory related things
- dstart -> bookmark
- *dnamelookup -> *namelookup
2023-08-07 14:00:44 -05:00
Christopher Haster
c928ed131f Changed all dir tests to be reentrant
To help with this, added TEST_PL, which is set to true when powerloss
testing. This way tests can check for stronger conditions (no EEXIST)
when not powerloss testing.

With TEST_PL, there's really no reason every test in t5_dirs shouldn't
be reentrant, and this gives us a huge improvement of test coverage very
cheaply.

---

The increased test coverage caught a bug, which is that gstate wasn't
being consumed properly when mtree uninlining. Humorously, this went
unnoticed because the most common form of mtree uninlining, mdir splitting,
ended up incorrectly consuming the gstate twice, which canceled itself
out since the consume operation is basically just xor.

Also added support for printing dstarts to dbglfs.py, to help debugging.
2023-07-18 21:40:43 -05:00
Christopher Haster
b98ac119c7 Added scripts/dbglfs.py for debugging the filesystem tree
Currently this can show:

- The filesystem tree:

    $ ./scripts/dbglfs.py disk -B4096
    littlefs v2.0 0x{0,1}.bd4, rev 1, weight 41
    mdir         ids      name                        type
    {00ce,00cf}:      0.1 dir0000                     dir 0x1070c73
    {0090,0091}:     2.30 |-> child0000               dir 0x8ec7fb2
    {0042,0043}:    24.35 |   |-> grandchild0000      dir 0x32d990b
    {0009,000a}:     25.0 |   |-> grandchild0001      dir 0x1461a08
                     25.1 |   |-> grandchild0002      dir 0x216e9fc
                     25.2 |   |-> grandchild0003      dir 0x7d6aff
                     25.3 |   |-> grandchild0004      dir 0x4b70e14
                     25.4 |   |-> grandchild0005      dir 0x6dc8d17
                     25.5 |   |-> grandchild0006      dir 0x58c7ee3
                     25.6 |   '-> grandchild0007      dir 0x7e7fde0
    {0090,0091}:     2.31 |-> child0001               dir 0xa87fcb1
    {0077,0078}:     29.1 |   |-> grandchild0000      dir 0x12194f5
                     29.2 |   |-> grandchild0001      dir 0x34a17f6
    ...

- The on-disk filesystem config:

    $ ./scripts/dbglfs.py disk -B4096 -c
    littlefs v2.0 0x{0,1}.bd4, rev 1, weight 41
    mdir         ids      tag                     data (truncated)
         config: major_version 2                  02                       .
                 minor_version 0                  00                       .
                 csum_type 2                      02                       .
                 flags 0                          00                       .
                 block_size 4096                  80 20                    .
                 block_count 256                  80 02                    ..
    ...

- Any global-state on-disk:

    $ ./scripts/dbglfs.py disk -B4096 -g -d
    littlefs v2.0 0x{0,1}.bd4, rev 1, weight 41
    mdir         ids      tag                     data (truncated)
         gstate: grm none                         00 00 00 cc 05 57 ff 7f .....W..
    {0000,0001}:       -1 grm 8                   01 03 24 cc 05 57 ff 7f ..$..W..
    {00ce,00cf}:        0 grm 3                   00 2f 1b                ./.
    {00d0,00d1}:        1 grm 3                   01 04 01                ...

  Note this already reveals a bug, since grm none should be all zeros.

Also made some other minor tweaks to dbg scripts for consistency.
2023-07-18 21:40:41 -05:00
Christopher Haster
3959545a9f Some dbg script work
- Changed how names are rendered in dbgbtree.py/dbgmtree.py to be
  consistent with non-names. The special rendering isn't really worth it
  now that names aren't just ascii/utf8.

- Changed the ordering of raw/device/human rendering of btree entries to
  be more consistent with rendering of other entries (don't attempt to
  group btree entries).

- Changed dbgmtree.py header to show information about the mtree.
2023-07-18 21:40:38 -05:00
Christopher Haster
c2d9f1b047 Implemented, but untested, global-removes
This implementation is in theory correct, but of course, being untested,
who knows?

Though this does come with remounting added to all of the directory
tests. This effectively tests that all of the directory creation tests
we have so far maintain grm=0 after each unmount-mount cycle. Which is
valuable.
2023-07-18 21:40:36 -05:00
Christopher Haster
cc0ac25b5e Implemented infrastructure necessary for global-removes
This has, in theory, global-removes (grm) being written out as a part of
of directory creation, but they aren't used in any form and so may not
be being written correctly.

But it did require quite a bit of problem solving to get to this point
(the interactions between mtree splitsand grms is really annoying), so
it's worth a commit.
2023-07-18 21:40:30 -05:00
Christopher Haster
da810aca26 Implemented mtree path/dname lookup, rudimentary lfsr_mkdir/lfsr_dir_read
This makes it now possible to create directories in the new system.

The new system now uses a single global "mtree" to store all metadata
entries in the filesystem. In this system, a directory is simply a range
of metadata entries. This has a number of benefits, but does come with
its own problems:

1. We need to indicate which directory each file belongs to. To do this
   the file's name entry has been changed to a tuple of leb128-encoded
   directory-id + actual file name:

     01 66 69 6c 65 2e 74 78 74  .file.txt
      ^ '----------+----------'
      '------------|------------ leb128 directory-id
                   '------------ ascii/utf8 name

   If we include the directory-id as part of filename comparison, files
   should naturally be next to other files in the same directory.

2. We need a way allocate directory-ids for new directories. This turns
   out to be a bit more tricky than I expected.

   We can't use any mid/bid/rid inherent to the mtree, because these
   change on any file creation/deletion. And since we commit the did
   into the tree, that's not acceptable.

   Initially I though you could just find the largest did and increment,
   but this gives you no way to reclaim deleted dids. And sure, deleted
   dids have no storage consumption, but eventually you will overflow
   the did integer. Since this can suddenly happen in a filesystem
   that's been in a steady-state for years, that's pretty unnacceptable.

   One solution is to do a simple linear search over the mtree for an
   unused did. But with a runtime of O(n^2 log(n)), this raises
   performance concerns.

   Sidenote: It's interesting to note that the Linux kernel's allocation
   of process-ids, a very similar problem, is surprisingly complex and
   relies on a radix-tree of bitmaps (struct idr). This suggests I'm not
   missing an obvious solution somewhere.

   The solution I settled on here is to instead treat the set of dids as
   a sort of hash table:

   1. Hash the full directory path into a did.
   2. Perform a linear search until we have no collision.

     leb128(truncate28(crc32c("dir")))
          .--------'
          v
     9e cd c8 30 66 69 6c 65 2e 74 78 74  ...0file.txt
     '----+----' '----------+----------'
          '-----------------|------------ leb128 directory-id
                            '------------ ascii/utf8 name

   Worst case, this can still exhibit the worst case O(n^2 log(n))
   performance when we are close to full dids. However that seems
   unlikely to happen in practice, since we don't truncate our hashes,
   unlike normal hash tables. An additional 32-bit word for each file
   is a small price to pay for a low-chance of collisions.

   In the current implementation, I do truncate the hash to 28-bits.
   Since we encode the hash with leb128, and hashes are statistically
   random, this gives us better usage of the leb128 encoding. However
   it does limit a 32-bit littlefs to 256 Mi directories.

   Maybe this should be a configurable limit in the future.

   But that highlights another benefit of this scheme. It's easy to
   change in the future without disk changes.

3. We need a way to know if a directory-id is allocated, even if the
   directory is empty.

   For this we just introduce a new tag: LFSR_TAG_DSTART, which
   is an empty file entry that indicates the directory at the given did
   in the mtree is allocated.

   To create/delete these atomically with the reference in our parent
   directory, we can use the GRM system for atomic renames.

   Note this isn't implemented yet.

This is also the first time we finally get around to testing all of the
dname lookup functions, so this did find a few bugs, mostly around
reporting the root correctly.
2023-07-05 13:41:21 -05:00
Christopher Haster
cf588ac3fa Dropped alt-always as an rbyd trunk terminator
Now that tree rebalancing is implemented and needed a null terminator
anyways, I think it's clear that the benefit of the alt-always pointers
as trunk terminator has pretty limited value.

Now a null or other tag is needed for every trunk, which simplifies
checks for end-of-trunk.

Alt-always tags are still emitted for deletes, etc, but there their
behavior is implicit, so no special checks are needed. Alt-always tags
are naturally cleaned up as a part of rbyd pruning.
2023-06-27 00:49:31 -05:00
Christopher Haster
43dc3a5c8d Implemented tree rebalancing during rbyd compaction
This isn't actually for performance reasons, but to reduce storage
overhead of the rbyd metadata tree, which was showing signs of being
problematic for small block sizes.

Originally, the plan for compaction was to rely on the self-balancing
rbyd append algorithm and simply append each tag to a new tree.
Unfortunately, since each append requires a rewrite of the trunk
(current search path), this introduces ~n*log(n) alts but only uses ~n alts
for the final tree. This really starts to put pressure on small blocks,
where the exponential-ness of the log doesn't kick in and overhead
limits are already tight.

Measuring lfsr_mdir_commit code size, this shows a ~556 byte cost on
thumb: 16416 -> 16972 (+3.4%). Though there are still some optimizations
on the table, this implementation needs a cleanup pass.

               alt overhead  code cost
  rebalance:        <= 28*n      16972
  append:    <= 24*n*log(n)      16416

Note these all assume worst case alt overhead, but we _need_ to assume
worst case for our rbyd estimations, or else the filesystem can get
stuck in unrecoverable compaction states.

Because of the code cost I'm not sure if rebalancing will stay, be
optional, or replace append-compaction completely yet.

Some implementation notes:

- Most tree balancing algorithms rely on true recursion, I suspect
  recursion may be a hard requirement in general, but it's hard to find
  bounded-ram algorithms.

  This solution gets around the ram requirement by leveraging the fact
  that our tags exist in a log to build up each layer in the tree
  tail-recursively. It's interesting to note that this is a special
  case of having little ram but lots of storage.

- Humorously this shouldn't result in a performance improvement. Rbyd
  trees result in a worst case 2*log(n) height, and rebalancing gives us
  a perfect worst case log(n) height, but, since we need an additional
  alt pointer for each node in our tree, things bump back up to 2*log(n).

- Originally the plan was to terminate each node with an alt-always tag,
  but during implementation I realized there was no easy way to get the
  key that splits the children with awkward tree lookups. As a
  workaround each node is terminated with an altle tag that contains the
  key followed by an unreachable null tag. This is redundant information,
  but makes the algorithm easier to implement.

  Fortunately null tags use the smallest tag encoding, which isn't that
  small, but that means this wastes at most 4*n bytes.

- Note this preserves the first-tag-always-ends-up-at-off=0x4 rule, which
  is necessary for the littlefs magic to end up in a consistent place.

- I've dropped dropping vestigial names for now, which means vestigial
  names can remain in btrees indefinitely. Need to revisit this.
2023-06-25 15:23:46 -05:00
Christopher Haster
799e7cfc81 Flipped btree encoding to allow variable redund blocks
This should have been done as a part of the earlier tag reencoding work,
since having the block at the end was what allowed us to move the
redund-count out of the tag encoding.

New encoding:

  [-- 32-bit csum   --]
  [-- leb128 weight --]
  [-- leb128 trunk  --]
  [-- leb128 block  --]

Note that since our tags have an explicit size, we can store a variable
number of blocks. The plan is to use this to eventually store redundant
copies for error correction:

  [-- 32-bit csum   --]
  [-- leb128 weight --]
  [-- leb128 trunk  --]
  [-- leb128 block  --] -.
  [-- leb128 block  --]  +- n redundant blocks
  [-- leb128 block  --]  |
           ...          -'

This does have a significant tradeoff, we need to know the checksum size
to access the btree structure. This doesn't seem like a big deal, but
with the possibility of different checksum types may be an annoying
issue.

Note that FCRC was also flipped for consistency.
2023-06-19 16:08:50 -05:00
Christopher Haster
e79c15b026 Implemented wide tags for both rbyd commit and lookup
Wide tags are a happy accident that fell out of the realization that we
can view all subtypes of a given tag suptype as a range in our rbyd.
Combining this with how natural it is to operate on ranges in an rbyd
allows us to perform operations on an entire range of subtypes as though
it were a single tag.

- lookup wide tag => find the smallest tag with this tag's suptype, O(log(n))
- remove wide tag => remove all tags with this tag's suptype, O(log(n))
- append wide tag => remove all tags with this tag's suptype, and then
  append our tag, O(log(n))

This is very useful for littlefs, where we've already been using tag's
subtypes to hold extra type info, and have had to rely on awkward
alternatives such as deleting existing subtypes before writing our new
subtype.

For example, when committing file metadata (not yet implemented), we can
append a wide struct tag to update the metadata while also clearing out any
lingering struct tags from previous commits, all in one rbyd append
operation.

This uses another mode bit in-device to change the behavior of
lfsr_rbyd_commit, of which we have a couple:

  vwgrtttt 0TTTTTTT
  ^^^^---^--------^- valid bit (currently unused, maybe errors?)
   '||---|--------|- wide bit, ignores subtype (in-device)
    '|---|--------|- grow bit, don't create new id (in-device)
     '---|--------|- rm bit, remove this tag (in-device)
         '--------|- 4-bit suptype
                  '- leb128 subtype
2023-06-19 16:08:43 -05:00
Christopher Haster
2467d2e486 Added a separate tag encoding for the mtree
This helps with debugging and can avoid weird issues if a file btree
ever accidentally ends up attached to id -1 (due to fs bug).

Though a separate encoding isn't strictly necessary, maybe this should
be reverted at some point.
2023-06-18 15:12:43 -05:00
Christopher Haster
7180b70c9c Allowed "alta" (altbgt 0) to terminate rbyd trunks, dropped rm bit
This replaces unr with null on disk, though note both the rm bit and unr
are used in-device still, they just don't get written to disk.

This removes the need for the rm bit on disk. Since we no longer need to
figure out what's been removed during fetch, we can save this bit for both
internal and future on-disk use.

Special handling of alta allows us to avoid emitting an unr tag (now null) if
the current trunk is truly unreachable. This is minor now, but important
for a theoretical rbyd rebalance operation (planned), which brings the
rbyd overhead down from ~3x to ~2x.

These changes give us two ways to terminate trunks without a tag:

1. With an alta, if the current trunk is unreachable:

     altbgt 0x403 w0 0x7b
     altbgt 0x402 w0 0x29
     alta w0 0x4

2. With a null, if the current trunk is reachable, either for
   code convenience or because emitting an alta is impossible (an empty
   rbyd for example):

     altbgt 0x403 w0 0x7b
     altbgt 0x402 w0 0x29
     altbgt 0x401 w0 0x4
     null
2023-06-17 18:11:45 -05:00
Christopher Haster
2113d877d6 Moved bits around in tag encoding to allow leb128 custom attributes
Yet another tag encoding, but hopefully narrowing in on a good long term
design. This change trades a subtype bit for the ability to extend
subtypes indefinitely via leb128 in the future.

The immediate benefit is ~unlimited custom attributes, though I'm not
sure how to make this configurable yet. Extended custom attributes may
have a significant impact on alt tag sizes, so it may be worth
defaulting to only 8-bit custom attributes still.

Tag encoding:

   vmmmtttt 0TTTTTTT 0wwwwwww 0sssssss
   ^--^---^--------^--------^--------^- valid bit
      '---|--------|--------|--------|- 3-bit mode
          '--------|--------|--------|- 4-bit suptype
                   '--------|--------|- leb128 subtype
                            '--------|- leb128 weight
                                     '- leb128 size/jump

This limits subtypes to 7-bits, but this seems very reasonable at the
moment.

This also seems to limit custom attributes to 7-bits, but we can use two
separate suptypes to bring this back up to 8-bits. I was planning to do
this anyways to have separate "user-attributes" and "system-attributes",
so this actually fits in really well.
2023-06-16 01:51:33 -05:00
Christopher Haster
2339e9865f Tweaked dbgmtree.py -Z flag to include mroots as depth
This helps debug a corrupted mtree with cycles, which has been a problem
in the past.

Also fixed a small rendering issue with dbgmtree.py not connecting inner
tree edges to mdir roots correctly during rendering.
2023-06-01 13:52:14 -05:00
Christopher Haster
c60fa69ce1 Optimized dbg*.py tree generation/rendering by deduplicating edges
Optimizing a script? This might sound premature, but the tree rendering
was, uh, quite slow for any decently sized (>1024) btree.

The main reason is that tree generation is quite hacky in places, repeatedly
spitting out multiple copies of the inner node's rbyd trees for example.

Rather than rewrite the tree generation implementation to be smarter,
this just changes all edge representations to namedtuples (which may
reduce memory pressure a bit), and collects them into a Python set.

This has the effect of deduplicating generated edges efficiently, and
improved the rendering performance significantly.

---

I also considered memoizing rbyd tree, but dropped the idea since the
current renderer performs well enough.
2023-05-30 18:17:51 -05:00
Christopher Haster
af0c3967b4 Adopted new tree renderer in dbgmtree, implemented mtree rendering
In addition to plugging in the rbyd and btree renderers in dbgbtree.py,
this required wiring in rbyd trees in the mdirs and mroots.

A bit tricky, but with a more-or-less straightforward implementation thanks
to the common edge description used for the tree renderer.

For example, a relatively small mtree:

  $ ./scripts/dbgmtree.py disk -B4096 -t -i
  mroot 0x{0,1}.45, rev 1, weight 0
  mdir                     ids   tag                     ...
  {0000,0001}: .--------->    -1 magic 8                 ...
               | .------->       config 21               ...
               +-+-+             btree 7                 ...
    0006.000a:     | .-+       0 mdir w1 2               ...
  {0002,0003}:     | | '->   0.0 inlined w1 1024         ...
    0006.000a:     '-+-+       1 mdir w1 2               ...
  {0004,0005}:         '->   1.0 inlined w1 1024         ...
2023-05-30 18:10:32 -05:00
Christopher Haster
9b803f9625 Reimplemented tree rendering in dbg*.py scripts
The goal here was to add the option to show the combined rbyd trees in
dbgbtree.py/dbgmtree.py.

This was quite tricky, (and not really helped by the hackiness of these
scripts), but was made a bit easier by adding a general purpose tree renderer
that can render a precomputed set of branches into the tag output.

For example, a 2-deep rendering of a simple btree with a block size of
1KiB, where you can see a bit of the emergent data-structure:

  $ ./scripts/dbgbtree.py disk -B1024 0x223 -t -Z2 -i
  btree 0x223.90, rev 46, weight 1024
  rbyd                       ids       tag                     ...
  0223.0090:     .-+             0-199 btree w200 9            ...
  00cb.0048:     | |     .->      0-39 btree w40 7             ...
                 | | .---+->     40-79 btree w40 7             ...
                 | | | .--->    80-119 btree w40 7             ...
                 | | | | .->   120-159 btree w40 7             ...
                 | '-+-+-+->   160-199 btree w40 7             ...
  0223.0090: .---+-+           200-399 btree w200 9            ...
  013e.004b: |     |     .->   200-239 btree w40 7             ...
             |     | .---+->   240-279 btree w40 8             ...
             |     | | .--->   280-319 btree w40 8             ...
             |     | | | .->   320-359 btree w40 8             ...
             |     '-+-+-+->   360-399 btree w40 8             ...
  0223.0090: | .---+           400-599 btree w200 9            ...
  01a7.004c: | |   |     .->   400-439 btree w40 8             ...
             | |   | .---+->   440-479 btree w40 8             ...
             | |   | | .--->   480-519 btree w40 8             ...
             | |   | | | .->   520-559 btree w40 8             ...
             | |   '-+-+-+->   560-599 btree w40 8             ...
  0223.0090: | | .-+           600-799 btree w200 9            ...
  021e.004c: | | | |     .->   600-639 btree w40 8             ...
             | | | | .---+->   640-679 btree w40 8             ...
             | | | | | .--->   680-719 btree w40 8             ...
             | | | | | | .->   720-759 btree w40 8             ...
             | | | '-+-+-+->   760-799 btree w40 8             ...
  0223.0090: +-+-+-+          800-1023 btree w224 10           ...
  021f.0298:       |     .->   800-839 btree w40 8             ...
                   |   .-+->   840-879 btree w40 8             ...
                   |   | .->   880-919 btree w40 8             ...
                   '---+-+->  920-1023 btree w104 9            ...

This tree renderer also replaces the adhoc tree rendere in dbgrbyd.py
for consistency.
2023-05-30 18:04:54 -05:00
Christopher Haster
b67fcb0ee5 Added dbgmtree.py for debugging the littlefs metadata-tree
This builds on dbgrbyd.py and dbgbtree.py by allowing for quick
debugging of the littlefs mtree, which is a btree of rbyd pairs with a
few bells and whistles.

This also comes with a number of tweaks to dbgrbyd.py and dbgbtree.py,
mostly changing rbyd addresses to support some more mdir friendly
formats.

The syntax for rbyd addresses is starting to converge into a couple
common patterns, which is nice for quickly determining what type of
address you are looking at at a glance:

- 0x12         => An rbyd at block 0x12
- 0x12.34      => An rbyd at block 0x12 with trunk 0x34
- 0x{12,34}    => An rbyd at either block 0x12 or block 0x34 (an mdir)
- 0x{12,34}.56 => An rbyd at either block 0x12 or block 0x34 with trunk 0x56

These scripts have also been updated to support any number of blocks in
an rbyd address, for example 0x{12,34,56,78}. This is a bit of future
proofing. >2 blocks in mdirs may be explored in the future for the
increased redundancy.
2023-05-30 18:04:54 -05:00
Christopher Haster
975a98b099 Renamed a few superblock-related things
- supermdir -> mroot
- supermagic -> magic
- superconfig -> config
2023-05-30 14:46:56 -05:00
Christopher Haster
738eb52159 Tweaked tag encoding/naming for btrees/branches
LFSR_TAG_BNAME => LFSR_TAG_BRANCH
LFSR_TAG_BRANCH => LFSR_TAG_BTREE

Maybe this will be a problem in the future if our branch structure is
not the same as a standalone btree, but I don't really see that
happening.
2023-05-30 13:41:28 -05:00
Christopher Haster
4e3dca0b81 Partial implementation of a rudimentary mtree
This became surprisingly tricky.

The main issue is knowing when to split mdirs, and how to determine
this without wasting erase cycles.

Unlike splitting btree nodes, we can't salvage failed compacts here. As
soon as the salvage commit is written to disk, the commit becomes immediately
visibile to the filesystem because it still exists in the mtree. This is
a problem if we lose power.

We're likely going to need to implement rbyd estimates. This is
something I hoped to avoid because it brings in quite a bit of
complexity and might lead to an annoying amount of storage waste since
our estimates will need to be conservative to avoid unrecoverable
situations.

---

Also changed the on-disk btree/branch struct to store a copy of the weight.

This was already required for the root of the btree, requiring the
weight to be stored in every btree pointer allows better code
deduplication at the cost of some redundancy on btree branches, where
the weight is already implied by the rbyd structure.

This weight is usually a single byte for most branches anyways.

This may be worth revisiting at some point to see if there's any other
unexpected tradeoffs.
2023-05-30 13:28:35 -05:00
Christopher Haster
70a3a2b16e Rough implementation of lfsr_format/mount/unmount
This work already indicates we need more data-related helper
functions. We shouldn't need this many function calls to do "simple"
operations such as fetch the superconfig if it exists.
2023-05-30 13:16:03 -05:00
Christopher Haster
a511696bad Added ability to bypass rbyd fetch during B-tree lookups
This is an absurd optimization that stems from the observation that the
branch encoding for the inner-rbyds in a B-tree is enough information to
jump directly to the trunk of the rbyd without needing an lfsr_rbyd_fetch.

This results in a pretty ridiculous performance jump from O(m log_m(n/m))
to O(log(m) log_m(n/m)).

If the complexity analysis isn't impressive enough, look at some rough
benchmarking of read operations for 4KiB-block, 1K-entry B-trees:

   12KiB ^     ::  :. :: .: .: :. : .: :. : : .. : : . : .: : : :
         |    .:: .::.::.:: ::.::::::::::::.::::::::.::::::::::::.
         |    : :::':: ::'::'::':: :' :':: :'::::::::': ::::::': :
before   |  ::: ::' :' :' :: :' '' '  ' '' : : : '' ' ' '
         | :::            ''
         |:
      0B :'------------------------------------------------------>

  .17KiB ^               ............:::::::::::::::::::::::::::::
         |   .   .....:::::'''''''''  '         '          '
         |  .::::::::::::
after    |  :':''
         |.::
         .:'
      0B :------------------------------------------------------->
         0                                                      1K

In order for this to work, the branch encoding did need to be tweaked
slightly. Before it stored block+off, now it stores block+trunk where
"trunk" is the offset of the entry point into the rbyd tree. Both off
and trunk are enough info to know when to stop fetching, if necessary,
but trunk allows lookups to jump directly into the branches rbyd tree
without a fetch.

With the change to trunk, lfsr_rbyd_fetch has also be extended to allow
fetching of any internal trunks, not just the last trunk in the commit.
This is very useful for dbgrbyd.py, but doesn't currently have a use in
littlefs itself. But it's at least valuable to have the feature available
in case it does become useful.

Note that two cases still requires the slower O(m log_m(n/m)) lookup
with lfsr_rbyd_fetch:

1. Name lookups, since we currently use a linear-search O(m) to find names.

2. Validating B-tree rbyd's, which requires a linear fetch O(m) to
   validate the checksums. We will need to do this at least once
   after mount.

It's also worth mentioning this will likely have a large impact on B-tree
traversal speed. Which is huge as I am expecting B-tree traversal to be
the main bottleneck once garbage-collection (or its replacement) is
involved.
2023-04-14 00:51:34 -05:00
Christopher Haster
7eb0c4763a Reversed LFSR_ATTR id/tag argument order
I've been wanting to make this change for a while now (tag,id => id,tag).
The id,tag order matches the common lexicographic order used for sorting
tuples. Sorting tag,id tuples by their id first is less common.

The reason for this order in the codebase is because all attrs on disk
start with their tag first, since its decoding determines the purpose of
the id field (keep in mind this includes other non-tree tags such as
crcs, alts, etc). But with the move to storing weights instead of tags
on disk, this gives us a clear point to switch from tag,w to id,tag
ordering.

I may be thinking to much about this, but it does affect a significant
amount of the codebase.
2023-04-14 00:43:33 -05:00
Christopher Haster
2142b4a09d Reworked dbgrbyd.py's tree renderer to make more sense
While the previous renderer was "technically correct", the attempt to
map rotated alts to their nearest neighbor just made the resulting tree
an unreadable mess.

Now the renderer prunes alts with unreachable edges (like they would be
during lfsr_rbyd_append). And aligns all alts with their destination
trunk. This results in a much more readable, if slightly less accurate,
rendering of the tree.

Example:

  $ ./scripts/dbgrbyd.py -B4096 disk 0 -t
  rbyd 0x0, rev 1, size 1508, weight 40
  off                     ids   tag                     data (truncated)
  0000032a:         .-+->     0 reg w1 1                73                       s
  00000026:         | '->   1-5 reg w5 1                62                       b
  00000259: .-------+--->  6-11 reg w6 1                6f                       o
  00000224: |     .-+-+-> 12-17 reg w6 1                6e                       n
  0000028e: |     | | '->    18 reg w1 1                70                       p
  00000076: |     | '---> 19-20 reg w2 1                64                       d
  0000038f: |     |   .-> 21-22 reg w2 1                75                       u
  0000041d: | .---+---+->    23 reg w1 1                78                       x
  000001f3: | |       .-> 24-27 reg w4 1                6d                       m
  00000486: | | .-----+-> 28-29 reg w2 1                7a                       z
  000004f3: | | | .-----> 30-31 reg w2 1                62                       b
  000004ba: | | | | .---> 32-35 reg w4 1                61                       a
  0000058d: | | | | | .-> 36-37 reg w2 1                65                       e
  000005c6: +-+-+-+-+-+-> 38-39 reg w2 1                66                       f
2023-04-14 00:41:55 -05:00
Christopher Haster
0ccf283321 Changed in-tree tags to store their weights
Sorting weights instead of ids just had a number of benefits, suggesting
this is a better design:

- Calculating the id and delta of each rbyd trunk is surprisingly
  easier - id is now just lower+w-1, and no extra conditions are
  needed for unr tags, which just have a weight of zero.

- Removes ambiguity around which id unr tags should be assigned to,
  especially unrs that delete ids.

- No more +-1 weirdness when encoding/decoding tag ids - the weight
  can be written as-is and -1 ids are infered from their weight and
  position in the tree (lower+w-1 = 0+0-1 = -1).

- Weights compress better under leb128 encoding, since they are usually
  quite small.
2023-04-14 00:32:05 -05:00
Christopher Haster
5a1c36f210 Attempting to add weight changes to every rbyd append
This does not work as is due to ambiguity with grows and insertions.

Before, these were disambiguated by seperate grow and attr tags. You
effectively grew the neighboring id before claiming its weight
as yours. But now that the attr itself creates the grow/insertion,
it's ambiguous which one is intended.
2023-04-13 18:58:56 -05:00
Christopher Haster
e5cd2904ee Tweaked always-follow alts to follow even for 0 tags
Changed always-follow alts that we use to terminated grow/shrink/remove
operations to use `altle 0xfff0` instead of `altgt 0`.

`altgt 0` gets the job done as long as you make sure tag 0 never ends up
in an rbyd query. But this kept showing up as a problem, and recent
debugging revealed some erronous 0 tag lookups created vestigial alt
pointers (not necessarily a problem, but space-wasting).

Since we moved to a strict 16-bit tag, making these `altle 0xfff0`
doesn't really have a downside, and means we can expect rbyd lookups
around 0 to behave how one would normally expect.

As a (very minor) plus, the value zero usually has special encodings in
instruction sets, so being able to use it for rbyd_lookups offers a
(very minor) code size saving.

---

Sidenote: The reasons altle/altgt is how it is and asymmetric:

1. Flipping these alts is a single bit-flip, which only happens if they
   are asymmetric (only one includes the equal case).

2. Our branches are biased to prefer the larger tag. This makes
   traversal trivial. It might be possible to make this still work with
   altlt/altge, but would require some increments/decrements, which
   might cause problems with boundary conditions around the 16-bit tag
   limit.
2023-03-27 02:32:08 -05:00
Christopher Haster
8f26b68af2 Derived grows/shrinks from rbyd trunk, no longer needing explicit tags
I only recently noticed there is enough information in each rbyd trunk
to infer the effective grow/shrinks. This has a number of benefits:

- Cleans up the tag encoding a bit, no longer expecting tag size to
  sometimes contain a weight (though this could've been fixed other
  ways).

  0x6 in the lower nibble now reserved exclusively for in-device tags.

- grow/shrinks can be implicit to any tag. Will attempt to leverage this
  in the future.

- The weight of an rbyd can no longer go out-of-sync with itself. While
  this _shouldn't_ happen normally, if it does I imagine it'd be very
  hard to debug.

  Now, there is only one source of knowledge about the weight of the
  rbyd: The most recent set of alt-pointers.

Note that remove/unreachable tags now behave _very_ differently when it
comes to weight calculation, remove tags require the tree to make the
tag unreachable. This is a tradeoff for the above.
2023-03-27 01:45:34 -05:00
Christopher Haster
546fff77fb Adopted full le16 tags instead of 14-bit leb128 tags
The main motivation for this was issues fitting a good tag encoding into
14-bits. The extra 2-bits (though really only 1 bit was needed) from
making this not a leb encoding opens up the space from 3 suptypes to
15 suptypes, which is nothing to shake a stick at.

The main downsides:
1. We can't rely on leb encoding for effectively-infinite extensions.
2. We can't shorten small tags (crcs, grows, shrinks) to one byte.

For 1., extending the leb encoding beyond 14-bits is already
unpalatable, because it would increase RAM costs in the tag
encoder/decoder,` which must assume a worst-case tag size, and would likely
add storage cost to every alt pointer, more on this in the next section.

The current encoding is quite generous, so I think it is unlikely we
will exceed the 16-bit encoding space. But even if we do, it's possible
to use a spare bit for an "extended" set of tags in the future.

As for 2., the lack of compression is a downside, but I've realized the
only tags that really matter storage-wise are the alt pointers. In any
rbyds there will be roughly O(m log m) alt pointers, but at most O(m) of
any other tags. What this means is that the encoding of any other tag is
in the noise of the encoding of our alt pointers.

Our alt pointers are already pretty densely packed. But because the
sparse key part of alt-pointers are stored as-is, the worst-case
encoding of in-tree tags likely ends up as the encoding of our
alt-pointers. So going up to 3-byte tags adds a surprisingly large
storage cost.

As a minor plus, le16s should be slightly cheaper to encode/decode. It
should also be slightly easier to debug tags on-disk.

  tag encoding:
                     TTTTtttt ttttTTTv
                        ^--------^--^^- 4+3-bit suptype
                                 '---|- 8-bit subtype
                                     '- valid bit
  iiii iiiiiii iiiiiii iiiiiii iiiiiii
                                     ^- m-bit id/weight
  llll lllllll lllllll lllllll lllllll
                                     ^- m-bit length/jump

Also renamed the "mk" tags, since they no longer have special behavior
outside of providing names for entries:
- LFSR_TAG_MK       => LFSR_TAG_NAME
- LFSR_TAG_MKBRANCH => LFSR_TAG_BNAME
- LFSR_TAG_MKREG    => LFSR_TAG_REG
- LFSR_TAG_MKDIR    => LFSR_TAG_DIR
2023-03-25 14:36:29 -05:00
Christopher Haster
f4e2a1a9b4 Tweaked dbgbtree.py to show higher names when -i is not specified
This avoids showing vestigial names in a case where they could be really
confusing.
2023-03-21 13:29:55 -05:00