This is the start of (yet another) rework of rybd range removals, this
time in an effort to preserve the rby structure that maps to a balanced
2-3-4 tree. Specifically, the property that all search paths have the
same number of black edges (2-3-4 nodes).
This is currently incomplete, as you can probably tell from the mess,
but this commit at least gets a working altn/alta encoding in place
necessary for representing empty 2-3-4 nodes. More on that below.
---
First the problem:
My assumption, when implementing the previous range removal algorithms,
was that we only needed to maintain the existing height of the tree.
The existing rbyd operations limit the height to strictly log n. And
while we can't _reduce_ the height to maintain perfect balance, we can
at least avoid _increasing_ the height, which means the resulting tree
should have a height <= log n. Since our rbyds are bounded by the
block_size b, this means worst case our rbyd can never exceed a height
<= log b, right?
Well, not quite.
This is true the instance after the remove operation. But there is an
implicit assumption that future rbyd operations will still be able to
maintain height <= log n after the remove operation. This turns out to
not be true.
The problem is that our rbyd appends only maintain height <= log n if
our rby structure is preserved. If the rby structure is broken, rbyd
append assumes an rby structure that doesn't exist, which can lead to an
increasingly unbalanced tree.
Consider this happily balanced tree:
.-------o-------. .--------o
.---o---. .---o---. .---o---. |
.-o-. .-o-. .-o-. .-o-. .-o-. .-o-. |
.o. .o. .o. .o. .o. .o. .o. .o. .o. .o. .o. .o. |
a b c d e f g h i j k l m n o p => a b c d e f g h i
'------+------'
remove
After a range removal it looks pretty bad, but note the height is still
<= log n (old n not the new n). We are still <= log b.
But note what happens if we start to insert attrs into the short half of
the tree:
.--------o
.---o---. |
.-o-. .-o-. |
.o. .o. .o. .o. |
a b c d e f g h i
.-----o
.--------o .-+-r
.---o---. | | | |
.-o-. .-o-. | | | |
.o. .o. .o. .o. | | | |
a b c d e f g h i j'k'l'
.-------------o
.---o .---+-----r
.--------o .-o .-o .-o .-+-r
.---o---. | | | | | | | | | |
.-o-. .-o-. | | | | | | | | | |
.o. .o. .o. .o. | | | | | | | | | |
a b c d e f g h i j'k'l'm'n'o'p'q'r'
Our right side is generating a perfectly balanced tree as expected, but
the left side is suddenly twice as far from the root! height(r')=3,
height(a)=6!
The problem is when we append l', we don't really know how tall the tree
is. We only know l' has one black edge, which assuming rby structure is
preserved, means all other attrs must have one black edge, so creating a
new root is justified.
In reality this just makes the tree grow increasingly unbalanced,
increasing the height of the tree by worst case log n every range
removal.
---
It's interesting to note this was discovered while debugging
test_fwrite_overwrite, specifically:
test_fwrite_overwrite:1181h1g2i1gg2l15o10p11r1gg8s10
It turns out the append fragments -> delete fragments -> append/carve
block + becksum loop contains the perfect sequence of attrs necessary to
turn this tree inbalance into a linked-list!
.-> 0 data w1 1
.-b-> 1 data w1 1
| .-> 2 data w1 1
.-b-b-> 3 data w1 1
| .-> 4 data w1 1
| .-b-> 5 data w1 1
| | .-> 6 data w1 1
.---b-b-b-> 7 data w1 1
| .-> 8 data w1 1
| .-b-> 9 data w1 1
| | .-> 10 data w1 1
| .-b-b-> 11 data w1 1
| .-b-----> 12 data w1 1
.-y-y-------> 13 data w1 1
| .-> 14 data w1 1
.-y---------y-> 15 data w1 1
| .-> 16 data w1 1
.-y-----------y-> 17 data w1 1
| .-> 18 data w1 1
.-y-------------y-> 19 data w1 1
| .-> 20 data w1 1
.-y---------------y-> 21 data w1 1
| .-> 22 data w1 1
.-y-----------------y-> 23 data w1 1
| .-> 24 data w1 1
.-y-------------------y-> 25 data w1 1
| .---> 26 data w1 1
| | .-> 27-2047 block w2021 10
b-------------------r-b-> becksum 5
Note, to reproduce this you need to step through with a breakpoint on
lfsr_bshrub_commit. This only shows up in the file's intermediary btree,
which at the time of writing ends up at block 0xb8:
$ ./scripts/test.py \
test_fwrite_overwrite:1181h1g2i1gg2l15o10p11r1gg8s10 \
-ddisk --gdb -f
$ ./scripts/watch.py -Kdisk -b \
./scripts/dbgrbyd.py -b4096 disk 0xb8 -t
(then b lfsr_bshrub_commit and continue a bunch)
---
So, we need to preserve the rby structure.
Note pruning red/yellow alts is not an issue. These aren't black, so we
aren't changing the number of black edges in the tree. We've just
effectively reduced a 3/4 node into a 2/3 node:
.-> a
.---b-> b .-> a <- 2 black
| .---> c .-b-> b
| | .-> d | .-> c
b-r-b-> e <- rm => b-b-> d <- 2 black
The tricky bit is pruning black alts. Naively this changes the number of
black edges/2-3-4 nodes in the tree, which is bad:
.-> a
.-b-> b .-> a <- 2 black
| .-> c .-b-> b
b-b-> d <- rm => b---> c <- 1 black
It's tempting to just make the alt red at this point, effectively
merging the sibling 2-3-4 node. This maintains balance in the subtree,
but still removes a black edge, causing problems for our parent:
.-> a
.-b-> b .-> a <- 3 black
| .-> c .-b-> b
.-b-b-> d | .-> c
| .-> e .-b-b-> d
| .-b-> f | .---> e
| | .-> g | | .-> f
b-b-b-> h <- rm => b-r-b-> g <- 2 black
In theory you could propagate this all the way up to the root, and this
_would_ probably give you a perfect self-balancing range removal
algorithm... but it's recursive... and littlefs can't be recursive...
.-> s
.-b-> t .-> s
| .-> u .-----b-> t
.-b-b-> v | .-> u
| .-> w | .---b-> v
| .-b-> x | | .---> w
| | | | .-> y | | | | | | | .-> x
b-b- ... b-b-b-> z <- rm => r-b-r-b- ... r-b-r-b-> y
So instead, an alternative solution. What if we allowed black alts that
point nowhere? A sort of noop 2-3-4 node that serves only to maintain
the rby structure?
.-> a
.-b-> b .-> a <- 2 black
| .-> c .-b-> b
b-b-> d <- rm => b-b-> c <- 2 black
I guess that would technically make this 1-2-3-4 tree.
This does add extra overhead for writing noop alts, which are otherwise
useless, but it seems to solve most of our problems: 1. does not
increase the height of the tree, 2. maintains the rby structure, 3.
tail-recursive.
And, thanks to the preserved rby structure, we can say that in the worst
case our rbyds will never exceed height <= log b again, even with range
removals.
If we apply this strategy to our original example, you can see how the
preserved rby structure sort of "absorbs" new red alts, preventing
further unbalancing:
.-------o-------. .--------o
.---o---. .---o---. .---o---. o
.-o-. .-o-. .-o-. .-o-. .-o-. .-o-. o
.o. .o. .o. .o. .o. .o. .o. .o. .o. .o. .o. .o. o
a b c d e f g h i j k l m n o p => a b c d e f g h i
'------+------'
remove
Reinserting:
.--------o
.---o---. o
.-o-. .-o-. o
.o. .o. .o. .o. o
a b c d e f g h i
.----------------o
.---o---. o
.-o-. .-o-. .------o
.o. .o. .o. .o. .o. .-+-r
a b c d e f g h i j'k'l'm'
.----------------------------o
.---o---. .-------------o
.-o-. .-o-. .---o .---+-----r
.o. .o. .o. .o. .-o .-o .-o .-o .-+-r
a b c d e f g h i j'k'l'm'n'o'p'q'r's'
Much better!
---
This commit makes some big steps towards this solution, mainly codifying
a now-special alt-never/alt-always (altn/alta) encoding to represent
these noop 1 nodes.
Technically, since null (0) tags are not allowed, these already exist as
altle 0/altgt 0 and don't need any extra carve-out encoding-wise:
LFSR_TAG_ALT 0x4kkk v1dc kkkk -kkk kkkk
LFSR_TAG_ALTN 0x4000 v10c 0000 -000 0000
LFSR_TAG_ALTA 0x6000 v11c 0000 -000 0000
We actually already used altas to terminate unreachable tags during
range removals, but this behavior was implicit. Now, altns have very
special treatment as a part of determining bounds during appendattr
(both unreachable gt/le alts are represented as altns). For this reason
I think the new names are warranted.
I've also added these encodings to the dbg*.py scripts for, well,
debuggability, and added a special case to dbgrby.py -j to avoid
unnecessary altn jump noise.
As a part of debugging, I've also extended dbgrbyd.py's tree renderer to
show trivial prunable alts. Unsure about keeping this. On one hand it's
useful to visualize the exact alt structure, on the other hand it likely
adds quite a bit of noise to the more complex dbg scripts.
The current state of things is a mess, but at least tests are passing!
Though we aren't actually reclaiming any altns yet... We're definitely
_not_ preserving the rby structure at the moment, and if you look at the
output from the tests, the resulting tree structure is hilarious bad.
But at least the path forward is clear.
This was throwing off tree rendering in dbglfs.py, we attempt to lookup
the null tag because we just want to first tag in the tree to stitch
things together.
Null tag reachability is tricky! You only notice if the tree happens to
create a hole, which isn't that common. I think all lookup
implementations should have this max(tag, 1) pattern from now on to
avoid this.
Note that most dbg scripts wouldn't run into this because we usually use
the traversal tag+1 pattern. Still, the inconsistency in impl between
the dbg scripts and lfs.c is bad.
This saves a bit of rbyd overhead, since these almost always come
together.
Perhaps more interesting, it carves out space for storing mroot-anchor
redundancy information. This uses the lowest two bits of the GEOMETRY
tag to indicate how many redundant blocks belong to the mroot-anchor:
LFSR_TAG_GEOMETRY 0x0008 v--- ---- ---- 1-rr
This solves a bit of a hole in our redundancy encoding. The plan is for
this info to be stored in the lowest two bits of every pointer, but the
mroot-anchor doesn't really have a pointer.
Though this is just future plans. Right now the redundancy information
is unused. Current implementations should use the GEOMETRY tag 0x0009,
which you may notice implied redundancy level-1. This matches our
current 2-block per mdir default.
Geometry attr encoding:
.---+---+---+---. tag (0x0008+r): 1 be16 2 bytes
|x0008+r| 0 |siz| weight (0): 1 leb128 1 byte
+---+---+---+---+ size: 1 leb128 1 byte
| block_size | block_size: 1 leb128 <=4 bytes
+---+- -+- -+- -+- -.
| block_count | block_count: 1 leb128 <=5 bytes
'---+- -+- -+- -+- -' total: <=13 bytes
Code changes:
code stack
before: 34092 2880
after: 34040 (-0.2%) 2880 (+0.0%)
Now with a bit more granularity for possibly-future-optional on-disk
data structures:
LFSR_RCOMPAT_NONSTANDARD 0x0001 ---- ---- ---- ---1 (reserved)
LFSR_RCOMPAT_MLEAF 0x0002 ---- ---- ---- --1-
LFSR_RCOMPAT_MSHRUB 0x0004 ---- ---- ---- -1-- (reserved)
LFSR_RCOMPAT_MTREE 0x0008 ---- ---- ---- 1---
LFSR_RCOMPAT_BSPROUT 0x0010 ---- ---- ---1 ----
LFSR_RCOMPAT_BLEAF 0x0020 ---- ---- --1- ----
LFSR_RCOMPAT_BSHRUB 0x0040 ---- ---- -1-- ----
LFSR_RCOMPAT_BTREE 0x0080 ---- ---- 1--- ----
LFSR_RCOMPAT_GRM 0x0100 ---- ---1 ---- ----
LFSR_WCOMPAT_NONSTANDARD 0x0001 ---- ---- ---- ---1 (reserved)
LFSR_OCOMPAT_NONSTANDARD 0x0001 ---- ---- ---- ---1 (reserved)
This adds a couple reserved flags:
- LFSR_*COMPAT_NONSTANDARD - This flag will never be set by a standard
version of littlefs. The idea is to allow implementations with
non-standard extensions a way to signal potential compatibility issues
without worrying about future compat flag conflicts.
This is limited to a single bit, but hey, it's not like it's possible
to predict all future extensions.
If a non-standard extension needs more granularity, reservations of
standard compat flags can always be requested, even if they don't end
up implemented in standard littlefs. (Though such reservations will
need a strong motivation, it's not like these flags are free).
- LFSR_RCOMPAT_MSHRUB - In theory littlefs supports a shrubbed mtree,
where the root is inlined into the mroot. But in practice this turned
out to be more complicated than it was worth. Still, a future
implementation may find an mshrub useful, so preserving a compat flag
for such a case makes sense.
That being said, I have no plans to add support for mshrubs even in
the dbg scripts.
I would like the expected feature-set for debug tools to be
well-defined, but also conservative. This gets a bit tricky with
theoretical features like the mshrubs, but until mshrubs are actually
implemented in littlefs, I would like to consider them non-standard.
The implication of this is that, while LFSR_RCOMPAT_MSHRUB is
currently "reserved", it may be repurposed for some other meaning in
the future.
These changes also rename *COMPATFLAGS -> *COMPAT, and reorder the tags
by decreasing importance. This ordering seems more valuable than the
original intention of making rcompat/wcompat a single bit flip.
Implementation-wise, it's interesting to note the internal-only
LFSR_*COMPAT_OVERFLOW flag. This gets set when out-of-range bits are set
on-disk, and allows us to detect unrepresentable compat flags without
too much extra complexity.
The extra encoding/decoding overhead does add a bit of cost though:
code stack
before: 33944 2880
after: 34124 (+0.5%) 2880 (+0.0%)
Now that we're assuming a perfect compaction algorithm, and an
infinitely compatible mleaf-bits, there really shouldn't be any reason
to support non-standard mleaf-bits in our scripts, right?
If a configurable mleaf-bits becomes necessary, we can always add this
back in the future.
As defined previously, mleaf-bits depended on the attr estimate, which
depended on the details of our compaction algorithm:
block_size
m = ----------
a_0
Assuming t=4, the _minimum_ tag encoding:
block_size block_size
m = ---------- = ----------
3*4 + 4 16
However, with our new compaction algorithm, our attr estimate changes:
block_size block_size block_size
m = ---------- = ----------- = ----------
a_1 (5/2)*4 + 2 12
But tying our mleaf-bits to our attr estimate is a bit fragile. Unlike
attr estimate, the calculated mleaf-bits MUST be the same across all
littlefs implementations, or else the filesystem may not be mountable.
We _could_ store mleaf-bits as an fs attr in the mroot, like we do with
name-limit, size-limit, block-size, etc, but I'd prefer to not add fs
attrs unless strictly required. Each fs attr adds complexity to mounting,
which has a non-zero cost and headache.
Instead, we can assume our compaction algorithm is perfect:
block_size block_size block_size
m = ---------- = ---------- = ----------
a_inf 2*4 8
This isn't actually achievable without unbounded RAM. But just because
our current implementation is limited to bounded RAM, does not prevent
some other implementation from pushing things further with unbounded
RAM.
In theory, since this is a perfect compaction algorithm, and builds
perfect rbyd trunks, this should be the maximum possible mleaf-bits
achievable in littlefs's current design, and should be compatible with
any future implementation.
---
Worst case, we can always add mleaf-bits as an fs attr retroactively
without breaking backwards compatibility. You would just need to assume
the above block_size-dependent value if the hypothetical mleaf-bits attr
is missing.
This is one nice thing about our fs attr system, it's very flexible.
This is a simplification of the rbyd/btree layers, but implies
behavioral changes to the mtree/mdir layers.
Instead of ordering by leb128 did + name:
82 02 61 61 61 < 81 04 62 62 62
(0x102, "aaa") (0x201, "bbb")
We now order by the raw encoding, lexicographically:
82 02 61 61 61 > 81 04 62 62 62
(0x102, "aaa") (0x201, "bbb")
This may be unintuitive, but note:
1. Files _within_ a directory are still ordered, since they share a did
prefix.
2. We don't really care about the relative ordering of dids, just
that they are unique. Changing the ordering at this level does not
interfere with any of our did-related functions.
3. The only thing we may care about is that the root, did=0, is the
first mtree entry. This is still true. No leb128 encoding is < 0x00
even after encoding.
The motivation for this change is to allow for other named-btrees in the
system that may used non-did-prefixed names. At least one of these makes
sense for a sort of "content-tree" (cksum -> data block mapping).
As a plus, this change makes it possible to compare names and do btree
namelookups without needing to decode the leb128 prefix. Although I'm
struggling a bit to figure out exactly where this is useful...
One downside, this ordering only works if dids are always stored in
their canonical encoding, that is, the smallest leb128 encoding possible
for a given did. I think this is a reasonable requirement for just our
dids.
Another downside is this did add a decent chunk of code.
I did try limiting the changes to lfsr_data_namecmp, but it didn't have
much impact. I guess most of the cost comes from the reworked
lfsr_data_cmp function, which, to be fair, is quite a bit more
complicated now (it now supports limited data<=>data comparisons):
code stack
before: 34148 2896
namecmp: 34324 (+0.5%) 2896 (+0.0%)
after: 34340 (+0.6%) 2896 (+0.0%)
Previously, the intention of upper case -Z was the match -W/--width and
-H/--height, which are uppercase to avoid conflicts with -h/--help.
But -z/--depth isn't _really_ related to -W/-H.
This avoids a conflict with -Z/--lebesgue, but may conflict with
-z/--cat. Fortunately we don't currently have any conflicts with the
latter. Since -z/--depth and -Z/--lebesgue are both disk-layout related,
the risk of conflicts are probably much higher there.
So now these should be invoked like so:
$ ./scripts/dbglfs.py -b4096x256 disk
The motivation for this change is to better match other filesystem
tooling. Some prior art:
- mkfs.btrfs
- -n/--nodesize => node size in bytes, power of 2 >= sector
- -s/--sectorsize => sector size in bytes, power of 2
- zfs create
- -b => block size in bytes
- mkfs.xfs
- -b => block size in bytes, power of 2 >= sector
- -s => sector size in bytes, power of 2 >= 512
- mkfs.ext[234]
- -b => block size in bytes, power of 2 >= 1024
- mkfs.ntfs
- -c/--cluster-size => cluster size in bytes, power of 2 >= sector
- -s/--sector-size => sector size in bytes, power of 2 >= 256
- mkfs.fat
- -s => cluster size in sectors, power of 2
- -S => sector size in bytes, power of 2 >= 512
Why care so much about the flag naming for internal scripts? The
intention is for external tooling to eventually use the same set of
flags. And maybe even create publically consumable versions of the dbg
scripts. It's important that if/when this happens flags stay consistent.
Everyone familiar with the ssh -p/scp -P situation knows how annoying
this can be.
It's especially important for littlefs's -b/--block-size flag, since
this will likely end up used everywhere. Unlike other filesystems,
littlefs can't mount without knowing the block-size, so any tool that
mounts littlefs is going to need the -b/--block-size flag.
---
The original motivation for -B was to avoid conflicts with the -b/--by
flag that was already in use in all of the measurement scripts. But
these are internal, and not really littlefs-related, so I don't think
that's a good reason any more. Worst case we can just make the --by flag
-B, or just not have a short form (--by is only 4 letters after all).
Somehow we ended up with no scripts needing both -b/--block-size and
-b/--by so far.
Some other conflicts/inconsistencies tweaks were needed, here are all
the flag changes:
- -B/--block-size -> -b/--block-size
- -M/--mleaf-weight -> -m/--mleaf-weight
- -b/--btree -> -B/--btree
- -C/--block-cycles -> -c/--block-cycles (in tracebd.py)
- -c/--coalesce -> -S/--coalesce (in tracebd.py)
- -m/--mdirs -> -M/--mdirs (in dbgbmap.py)
- -b/--btrees -> -B/--btrees (in dbgbmap.py)
- -d/--datas -> -D/--datas (in dbgbmap.py)
Shrubness should have always been a property of lfsr_rbyd_t.
You know you've made a good design decision when things just sort of
fall into place and the code somehow becomes cleaner.
The downside of this change is accessing rbyd trunks requires a mask,
which is annoying, but the upside is we don't need to signal shrubness
via extra booleans in internal functions anymore.
The funny thing is, the actual motivation for this change is was just to
free up a bit in our tag encoding. Simplifying some of the internal
functions was just a nice side effect.
code stack
before: 33940 2928
after: 33928 (-0.0%) 2912 (-0.5%)
I was originally avoiding naming these orphans, as they're _technically_
not orphans. They do exist in the mtree. But the name orphan just
describes this types purpose too well.
This does lead to some confusing terms, such as the fact that orphan
files can be non-orphaned if there are any in-device references. But I
think this makes sense?
- LFSR_TAG_SCRATCH -> LFSR_TAG_ORPHAN
- LFSR_F_UNCREAT -> LFSR_F_ORPHAN
- test_fscratch.toml -> test_forphan.toml
"Scratch files" are a new file type added to solve the zero-sized
file problem. Though they have a few other uses that may be quite
valuable.
The "zero-sized file problem" is a common surprise for users, where what
seems like a simple file create+write operation:
lfs_file_open(&lfs, &file, "hi",
LFS_O_WRONLY | LFS_O_CREAT | LFS_O_EXCL);
lfs_file_write(&lfs, &file, "hello!", strlen("hello!"));
lfs_file_close(&lfs, &file);
Can end up create a zero-sized file under powerloss, breaking user
assumptions and their code.
The tricky thing is that this is actually correct behavior as defined by
POSIX. `open` with O_CREAT creats a file entry immediately, which is
initially zero-sized. And the fact that power can be lost between `open`
and `close` isn't really avoidable.
But this is a common enough footgun that it's probably worth deviating
from POSIX here.
But how to avoid zero-sized files exactly? First thought: Delay the file
creation until sync/close, tracking uncreated files in-device until
then. This solves the problem and avoids any intermediary state if we
lose power, but came with a number of headaches:
1. Since we delay file creation, we don't immediately write the filename
to disk on open. This implies we need to keep the filename allocated
in RAM until the first sync/close call.
The requirement to keep the filename allocated for new files until
first sync/close could be added to open, and with the option to call
sync immediately to save the filename (and accept the risk of
zero-sized files), I don't think it would be _that_ bad of an API.
But it would still be pretty bad. Extra bad because 1. there's no
way to warn on misuse at compile-time, 2. use-after-free bugs have a
tendency to go unnoticed annoyingly often, 3. it's a regression from
the previous API, and 4. who the heck reads the more-or-less same
`open` documentation for every filesystem they adopt.
2. Without an allocated mid, tracking files internally gets a lot
harder. The best option I could think of was to keep the opened-file
linked-list sorted by mid + (in-device) file name.
This did not feel like a great solutiona and was going to add more
code cost.
3. Handling mdir splits containing uncreated files adds another
headache. Complicated lfsr_mdir_estimate further as it needs to
decide in which mdir the uncreated files will end up, and potentially
split on a filename that isn't even created yet.
4. Since the number of uncreated files can be potentially unbounded, you
can't prevent an mdir from filling up with only uncreated files. On
disk this ends up looking like an "empty" mdir, which need specially
handling in littlefs to reclaim after powerloss.
Support for empty mdirs -- the orphaned mdir scan -- was already
added earlier. We already scan each mdir to build gstate, so it
doesn't really add much cost.
Notice that last bullet point? We already scan each mdir during mount.
Why not, instead of scanning for orphaned mdirs, scan for orphaned
files?
So this leads to the idea of "scratch files". Instead of actually
delaying file creation, fake it. Create a scratch file during open, and
on the first sync/close, convert it to a regular file. If we lose power,
scan for scratch files during mount, and remove them on first write.
Some tradeoffs:
1. The orphan scan for scratch files is a bit more expensive than for
mdirs on storage with large block sizes. We need to look at each file
entry vs just each mdir, which pushed the runtime up to O(BlogB) vs
O(B).
Though if you also consider large mtrees, the worst case is still
O(nlogn).
2. Creating intermediate scratch files adds another commit to file
creation.
This is probably not a big issue for flash, but may be more of a
concern on devices with large prog sizes.
3. Scratch files complicate unrelated mkdir/rename/etc code a bit, since
we need to consider what happens when the dest is a scratch file.
But the end result is simple. And simple is good. Both for
implementation headaches, and code size. Even if the on-disk state is
conceptually more complicated.
You may have noticed these scratch files are basically isomorphic to
just setting an "uncreated" flag on the file, and that's true. There may
have been a simpler route to end up with the design, but hey, as long as
it works.
As a plus, scratch files present a solution for a couple other things:
1. Removing an open file can become a scratch file until closed.
2. Scratch files can be used as temporary files. Open a file with
O_DESYNC and never call sync and you have yourself a temporary file.
Maybe in the future we should add O_TMPFILE to avoid the need for
unique filenames, but that is low priority.
Much like the erased-state checksums in our rbyds (ecksums), these
block-level erased-state checksums (becksums) allow us to detect failed
progs to erased parts of a block and are key to achieving efficient
incremental write performance with large blocks and frequent power
cycles/open-close cycles.
These are also key to achieving _reasonable_ write performance for
simple writes (linear, non-overwriting), since littlefs now relies
solely on becksums to efficiently append to blocks.
Though I suppose the previous block staging logic used with the CTZ
skip-list could be brought back to make becksums optional and avoid
btree lookups during simple writes (we do a _lot_ of btree
lookups)... I'll leave this open as a future optimization...
Unlike in-rbyd ecksums, becksums need to be stored out-of-band so our
data blocks only contain raw data. Since they are optional, an
additional tag in the file's btree makes sense.
Becksums are relatively simple, but they bring some challenges:
1. Adding becksums to file btrees is the first case we have for multiple
struct tags per btree id.
This isn't too complicated a problem, but requires some new internal
btree APIs.
Looking forward, which I probably shouldn't be doing this often,
multiple struct tags will also be useful for parity and content ids
as a part of data redundancy and data deduplication, though I think
it's uncontroversial to consider this both heavier-weight features...
2. Becksums only work if unfilled blocks are aligned to the prog_size.
This is the whole point of crystal_size -- to provide temporary
storage for unaligned writes -- but actually aligning the block
during writes turns out to be a bit tricky without a bunch of
unecesssary btree lookups (we already do too many btree lookups!).
The current implementation here discards the pcache to force
alignment, taking advantage of the requirement that
cache_size >= prog_size, but this is corrupting our block checksums.
Code cost:
code stack
before: 31248 2792
after: 32060 (+2.5%) 2864 (+2.5%)
Also lfsr_ftree_flush needs work. I'm usually open to gotos in C when
they improve internal logic, but even for me, the multiple goto jumps
from every left-neighbor lookup into the block writing loop is a bit
much...
Instead of writing every possible config that has the potential to be
useful in the future, stick to just writing the configs that we know are
useful, and error if we see any configs we don't understand.
This prevents unnecessary config bloat, while still allowing configs to
be introduced in a backwards compatible way in the future.
Currently unknown configs are treated as a mount error, but in theory
you could still try to read the filesystem, just with potentially
corrupted data. Maybe this could be behind some sort of "FORCE" mount
flag. littlefs must never write to the filesystem if it finds unknown
configs.
---
This also creates a curious case for the hole in our tag encoding
previously taken up by the OCOMPATFLAGS config. We can query for any
config > SIZELIMIT with lookupnext, but the OCOMPATFLAGS flag would need
an extra lookup which just isn't worth it.
Instead I'm just adding OCOMPATFLAGS back in. To support OCOMPATFLAGS
littlefs has to do literally nothing, so this is really more of a
documentation change. And who know, maybe OCOMPATFLAGS will have some
weird use case in the future...
Also:
- Renamed GSTATE -> GDELTA for gdelta tags. GSTATE tags added as
separate in-device flags. The GSTATE tags were already serving
this dual purpose.
- Renamed BSHRUB* -> SHRUB when the tag is not necessarily operating
on a file bshrub.
- Renamed TRUNK -> BSHRUB
The tag encoding space now has a couple funky holes:
- 0x0005 - Hole for aligning config tags.
I guess this could be used for OCOMPATFLAGS in the future?
- 0x0203 - Hole so that ORPHAN can be a 1-bit difference from REG. This
could be after BOOKMARK, but having a bit to differentiate littlefs
specific file types (BOOKMARK, ORPHAN) from normal file types (REG,
DIR) is nice.
I guess this could be used for SYMLINK if we ever want symlinks in the
future?
- 0x0314-0x0318 - Hole so that the mdir related tags (MROOT, MDIR,
MTREE) are nicely aligned.
This is probably a good place for file-related tags to go in the
future (BECKSUM, CID, COMPR), but we only have two slots, so will
probably run out pretty quickly.
- 0x3028 - Hole so that all btree related tags (BTREE, BRANCH, MTREE)
share a common lower bit-pattern.
I guess this could be used for MSHRUB if we ever want mshrubs in the
future?
I'm just not seeing a use case for optional compat flags (ocompat), so
dropping for now. It seems their *nix equivalent, feature_compat, is
used to inform fsck of things, but this doesn't really make since in
littlefs since there is no fsck. Or from a different perspective,
littlefs is always running fsck.
Ocompat flags can always be added later (since they do nothing).
Unfortunately this really ruins the alignment of the tag encoding. For
whatever reason config limits tend to come in pairs. For now the best
solution is just leave tag 0x0006 unused. I guess you can consider it
reserved for hypothetical ocompat flags in the future.
---
This adds an rcompat flag for the grm, since in theory a filesystem
doesn't need to support grms if it never renames files (or creates
directories?). But if a filesystem doesn't support grms and a grms gets
written into the filesystem, this can lead to corruption.
I think every piece of gstate will end up with its own compat flag for
this reason.
---
Also renamed r/w/oflags -> r/w/ocompatflags to make their purpose
clearer.
---
The code impact of adding the grm rcompat flag is minimal, and will
probably be less for additional rcompat flags:
code stack
before: 31528 2752
after: 31584 (+0.2%) 2752 (+0.0%)
It turned out by implicitly handling root allocation in
lfsr_btree_commit_, we were never allowing lfsr_bshrub_commit to
intercept new roots as new bshrubs. Fixing this required moving the
root allocation logic up into lfsr_btree_commit.
This resulted in quite a bit of small bug fixing because it turns out if
you can never create non-inlined bshrubs you never test non-inlined
bshrubs:
- Our previous rbyd.weight == btree.weight check for if we've reached
the root no longer works, changed to an explicit check that the blocks
match. Fortunately, now that new roots set trunk=0 new roots are no
longer a problematic case.
- We need to only evict when we calculate an accurate estimate, the
previous code had a bug where eviction occurred early based only on the
progged-since-last-estimate.
- We need to manually set bshrub.block=mdir.block on new bshrubs,
otherwise the lfsr_bshrub_isbshrub check fails in mdir commit staging.
Also updated btree/bshrub following code in the dbg scripts, which
mostly meant making them accept both BRANCH and SHRUBBRANCH tags as
btree/bshrub branches. Conveniently very little code needs to change
to extend btree read operations to support bshrubs.
Note this is intentionally different from how lfsr_rbyd_fetch behaves
in lfs.c. We only call lfsr_rbyd_fetch when we need validated checksums,
otherwise we just don't fetch.
The dbg scripts, on the other hand, always go through fetch, but it is
useful to be able to inspect the state of incomplete trunks when
debugging.
This use to be how the dbg scripts behaved, but they broke because of
some recent script work.
dbgbmap.py parses littlefs's mtree/btrees and displays that status of
every block in use:
$ ./scripts/dbgbmap.py disk -B4096x256 -Z -H8 -W64
bd 4096x256, 7.8% mdir, 10.2% btree, 78.1% data
mmddbbddddddmmddddmmdd--bbbbddddddddddddddbbdddd--ddddddmmdddddd
mmddddbbddbbddddddddddddddddbbddddbbddddddmmddbbdddddddddddddddd
bbdddddddddddd--ddddddddddddddddbbddddmmmmddddddddddddmmmmdddddd
ddddddddddbbdddddddddd--ddddddddddddddmmddddddddddddddddddddmmdd
ddddddbbddddddddbb--ddddddddddddddddddddbb--mmmmddbbdddddddddddd
ddddddddddddddddddddbbddbbdddddddddddddddddddddddddddddddddddddd
dddddddddd--ddddbbddddddddmmbbdd--ddddddddddddddbbmmddddbbdddddd
ddmmddddddddddmmddddddddmmddddbbbbdddddddd--ddbbddddddmmdd--ddbb
(ok, it looks a bit better with colors)
dbgbmap.py matches the layout and has the same options as tracebd.py,
allowing the combination of both to provide valuable insight into what
exactly littlefs is doing.
This required a bit of tweaking of tracebd.py to get right, mostly
around conflicting order-based arguments. This also reworks the internal
Bmap class to be more resilient to out-of-window ops, and adds an
optional informative header.