This implements a common B-tree using rbyd's as inner nodes.
Since our rbyds actually map to sorted arrays, this fits together quite
well.
The main caveat/concern is that we can't rely on strict knowledge on the
on-disk size of these things. This first shows up with B-tree insertion,
we can't split in preparation to insert as we descend down the tree.
Normally, this means our B-tree would require recursion in order to keep
track of each parent as we descend down our tree. However, we can
avoid this by not storing our parent, but by looking it up again on each
step of the splitting operation.
This brute-force-ish approach makes our algorithm tail-recursive, so
bounded RAM, but raises our runtime from O(logB(n)) to O(logB(n)^2)
That being said, O(logB(n)^2) is still sublinear, and, thanks to
B-tree's extremely high branching factor, may be insignificant.
The way sparse ids interact with our flat id+attr tree is a bit wonky.
Normally, with weighted trees, one entry is associated with one weight.
But since our rbyd trees use id+attr pairs as keys, in theory each set of
id+attr pairs should share a single weight.
+-+-+-+-> id0,attr0 -.
| | | '-> id0,attr1 +- weight 5
| | '-+-> id0,attr2 -'
| | |
| | '-> id5,attr0 -.
| '-+-+-> id5,attr1 +- weight 5
| | '-> id5,attr2 -'
| |
| '-+-> id10,attr0 -.
| '-> id10,attr1 +- weight 5
'-------> id10,attr2 -'
To make this representable, we could give a single id+attr pair the
weight, and make the other attrs have a weight of zero. In our current
scheme, attr0 (actually LFSR_TAG_MK) is the only attr required for every
id, and it has the benefit of being the first attr found during
traversal. So it is the obvious choice for storing the id's effective weight.
But there's still some trickiness. Keep in mind our ids are derived from
the weights in the rbyd tree. So if follow intuition and implement this naively:
+-+-+-+-> id0,attr0 weight 5
| | | '-> id5,attr1 weight 0
| | '-+-> id5,attr2 weight 0
| | |
| | '-> id5,attr0 weight 5
| '-+-+-> id10,attr1 weight 0
| | '-> id10,attr2 weight 0
| |
| '-+-> id10,attr0 weight 5
| '-> id15,attr1 weight 0
'-------> id15,attr2 weight 0
Suddenly the ids in the attr sets don't match!
It may be possible to work around this with special cases for attr0, but
this would complicate the code and make the presence of attr0 a strict
requirement.
Instead, if we associate each attr set with not the smallest id in the
weight but the largest id in the weight, so id' = id+(weight-1), then
our requirements work out while still keeping each attr set on the same
low-level id:
+-+-+-+-> id4,attr0 weight 5
| | | '-> id4,attr1 weight 0
| | '-+-> id4,attr2 weight 0
| | |
| | '-> id9,attr0 weight 5
| '-+-+-> id9,attr1 weight 0
| | '-> id9,attr2 weight 0
| |
| '-+-> id14,attr0 weight 5
| '-> id14,attr1 weight 0
'-------> id14,attr2 weight 0
To be blunt, this is unintuitive, and I'm worried it may be its own
source of complexity/bugs. But this representation does solve the problem
at hand, so I'm just going to see how it works out.
- Fixed off-by-one id for unknown tags.
- Allowed block_size and block to go unspecified, assumes the block
device is one big block in that case.
- Added --buffer and --ignore-errors to watch.py, making it a bit better
for watching slow and sometimes error scripts, such as dbgrbyd.py when
watching a block device under test.
Well not really fixed, more just added an assert to make sure
lfsr_rbyd_lookup is not called with tag 0. Because our alt tags only
encode less-than-or-equal and greater-than, which can be flipped
trivially, it's not possible to encode removal of tag 0 during deletes.
Fortunately, this tag should already not exist for other pragmatic
reasons, it was just used as the initial value for traversals, where it
could cause this bug.
If we combine rbyd ids and B-tree weights, we need 32-bit ids since this
will eventually need to cover the full range of a file. This simply
doesn't fit into a single word anymore, unless littlefs uses 64-bit tags.
Generally not a great idea for a filesystem targeting even 8-bit
microcontrollers.
So here is a tag encoding that uses 3 leb128 words. This will likely
have more code cost and slightly more disk usage (we can no longer fit
tags into 2 bytes), though with most tags being alt pointers (O(m log m)
vs O(m)), this may not be that significant.
Note that we try to keep tags limited to 14-bits to avoid an extra leb128 byte,
which would likely affect all alt pointers. To pull this off we do away
with the subtype/suptype distinction, limiting in-tree tag types to
10-bits encoded on a per-suptype basis:
in-tree tags:
ttttttt ttt00rv
^--^^- 10-bit type
'|- removed bit
'- valid bit
iiii iiiiiii iiiiiii iiiiiii iiiiiii
^- n-bit id
lllllll lllllll lllllll lllllll
^- m-bit length
out-of-tree tags:
ttttttt ttt010v
^---^- 10-bit type
'- valid bit
0000000
lllllll lllllll lllllll lllllll
^- m-bit length
alt tags:
kkkkkkk kkk1dcv
^-^^^- 10-bit key
'||- direction bit
'|- color bit
'- valid bit
wwww wwwwwww wwwwwww wwwwwww wwwwwww
^- n-bit weight
jjjjjjj jjjjjjj jjjjjjj jjjjjjj
^- m-bit jump
The real pain is that with separate integers for id and tag, it no
longer makes sense to combine these into one big weight field. This
requires a significant rewrite.
I'm still not sure this is the best decision, since it may add some
complexity to tag parsing, but making most crcs one byte may be valuable
since these exist in every single commit.
This gives tags three high-level encodings:
in-tree tags:
iiiiiii iiiiitt ttTTTTT TTT00rv
^----^--------^--^^- 16-bit id
'--------|--||- 4-bit suptype
'--||- 8-bit subtype
'|- removed bit
'- valid bit
lllllll lllllll lllllll lllllll
^- n-bit length
out-of-tree tags:
------- -----TT TTTTTTt ttt01pv
^----^--^^- 8-bit subtype
'--||- 4-bit suptype
'|- perturb bit
'- valid bit
lllllll lllllll lllllll lllllll
^- n-bit length
alt tags:
wwwwwww wwwwwww wwwwwww www1dcv
^-^^^- 28-bit weight
'||- direction bit
'|- color bit
'- valid bit
jjjjjjj jjjjjjj jjjjjjj jjjjjjj
^- n-bit jump
Having the location of the subtype flipped for crc tags vs tree tags is
unintuitive, but it makes more crc tags fit in a single byte, while
preserving expected tag ordering for tree tags.
The only case where crc tags don't fit in a single byte if is non-crc
checksums (sha256?) are added, at which point I expect the subtype to
indicate which checksum algorithm is in use.
$ ./scripts/dbgrbyd.py disk 4096 0 -t
mdir 0x0, rev 1, size 121
off tag data (truncated)
0000005e: +-+-+--> uattr 0x01 4 aa aa aa aa ....
0000000f: | | '--> uattr 0x02 4 aa aa aa aa ....
0000001d: | '----> uattr 0x03 4 aa aa aa aa ....
0000002d: | .----> uattr 0x04 4 aa aa aa aa ....
0000003d: | | .--> uattr 0x05 4 aa aa aa aa ....
0000004f: '-+-+-+> uattr 0x06 4 aa aa aa aa ....
00000004: '> uattr 0x07 4 aa aa aa aa ....
Unfortunately this tree can end up a bit confusing when alt pointers
live in unrelated search paths...
Toying around with the idea that since rbyd trees have strict height
gaurantees after compaction (2*log2(n)+1), we can proactively calculate
the maximum on-disk space required for a worst case tree+leb128
encoding.
This would _greatly_ simplify things such as metadata compaction and
splitting, and allow unstorable file metadata (too many custom
attributes) to error early.
One issue is that this calculated worst case will likely be ~4-5x worst
than the actual encoding due to leb128 compression. Though this may be an
acceptable tradeoff for the simplification and more reliable behavior.
Previously the subtype was encoded above the suptype. This was an issue
if you wanted to, say, traverse all tags in a given suptype.
I'm not sure yet if this sort of functionality is needed, it may be
useful for cleaning up/replacing classes of tags, such as file struct
tags, but not sure yet. At the very least is avoids unintuitive tag
ordering in the tree, which could potential cause problems for
create/deletes.
New encoding:
tags:
iiiiiii iiiiitt ttTTTTT TTT0trv
^----^--------^-^^^- 16-bit id
'--------|-'||- 5-bit suptype (split)
'--||- 8-bit subtype
'|- perturb/remove bit
'- valid bit
lllllll lllllll lllllll lllllll
^- n-bit length
alts:
wwwwwww wwwwwww wwwwwww www1dcv
^^^-^- 28-bit weight
'|-|- color bit
'-|- direction bit
'- valid bit
jjjjjjj jjjjjjj jjjjjjj jjjjjjj
^- n-bit jump
Also a large amount of name changes and other cleanup.
Tree deletion is such a pain. It always seems like an easy addition to
the core algorithm but always comes with problems.
The initial plan for deletes was to iterate through all tags, tombstone,
and then adjust weights as needed. This accomplishes deletes with little
change to the rbyd algorithm, but adds a complex traversal inside the
commit logic. Doable in one commit, but complex. It also risks weird
unintuitive corner cases since the cost of deletion grows with the number
of tags being deleted (O(m log n)).
But this rbyd data structure is a tree, so in theory it's possible to
delete a whole range of tags in a single O(log n) operation.
---
This is a proof-of-concept range deletion algorithm for rbyd trees.
Note, this does not preserve rbyd's balancing properties! But it is no
worse than tombstoning. This is acceptable for littlefs as any
unbalanced trees will be rebalanced during compaction.
The idea is to follow the same underlying dhara algorithm, where we
follow a search path and save any alt pointers not taken, but we follow
both search paths that form the outside of the range, and only keep
outside edges.
For example, a tree:
.-------o-------.
| |
.---o---. .---o---.
| | | |
.-o-. .-o-. .-o-. .-o-.
| | | | | | | |
a b c d e f g h
To delete the range d-e, we would search for d, and search for e:
********o********
* *
.---***** *****---.
| * * |
.-o-. .-*** ***-. .-o-.
| | | * * | | |
a b c d e f g h
And keep the outside edges:
.--- ---.
| |
.-o-. .- -. .-o-.
| | | | | |
a b c f g h
But how do we combine the outside edges? The simpler option is to do
both searches seperately, one after the other. This would end up with a
tree like this:
.---------o
| |
.-o-. .---o
| | | |
a b c o---------.
| |
o---. .-o-.
| | | |
_ f g h
But this horribly throws off the balance of our tree! It's worse than
tombstoning, and gets worse with more tags.
An alternative strategy, which is used here, is to alternate edges as we
descend down the tree. This unfortunately is more complex, and requires
~2x the RAM, but better preserves the balance of our tree. It isn't
perfect, because we lose color information, but we can leave that up to
compaction:
.---------o
| |
.-o-. o---------.
| | | |
a b .---o .-o-.
| | | |
c o---. g h
| |
_ f
I also hope this can be merged into lfs_rbyd_append, deduplicating the
entire core rbyd append algorithm.
Considered adding --ignore-errors to watch.py, but it doesn't really
make sense with watch.py's implementation. watch.py would need to not update
in realtime, which conflicts with other use cases.
It's quite lucky a spare bit is free in the tag encoding, this means we
don't need a reserved length value as originally planned. We end up using
all of the bits that overlap the alt pointer encoding, which is nice and
unexpected.
It turns out statefulness works quite well with this algorithm (The
prototype was in Haskell, which created some artificial problems. I
think it may have just been too high-level a language for this
near-instruction-level algorithm).
This bias makes it so that tag lookups always find a tag strictly >= the
requested tag, unless we are at the end of the tree.
This makes tree traversal trivial, which is quite nice.
Need to remove ntag now, it's no longer needed.
- Moved alt encoding 0x1 => 0x4, which can lead to slightly better
lookup tables, the perturb bit takes the same place as the color bit,
which means both can be ignored in readonly operations.
- Dropped lfs_rbyd_fetchmatch, asking each lfs_rbyd_fetch to include NULL
isn't that bad.
New encoding:
tags:
iiii iiiiiii iiiiiTT TTTTTTt ttt0tpv
^--------^------^^^- 16-bit id
'------|||- 8-bit type2
'||- 5-bit type1
'|- perturb bit
'- valid bit
llll lllllll lllllll lllllll lllllll
^- n-bit length
alts:
wwww wwwwwww wwwwwww wwwwwww www1dcv
^^^-^- 28-bit weight
'|-|- color bit
'-|- direction bit
'- valid bit
jjjj jjjjjjj jjjjjjj jjjjjjj jjjjjjj
^- n-bit jump