The main motivation for this was issues fitting a good tag encoding into
14-bits. The extra 2-bits (though really only 1 bit was needed) from
making this not a leb encoding opens up the space from 3 suptypes to
15 suptypes, which is nothing to shake a stick at.
The main downsides:
1. We can't rely on leb encoding for effectively-infinite extensions.
2. We can't shorten small tags (crcs, grows, shrinks) to one byte.
For 1., extending the leb encoding beyond 14-bits is already
unpalatable, because it would increase RAM costs in the tag
encoder/decoder,` which must assume a worst-case tag size, and would likely
add storage cost to every alt pointer, more on this in the next section.
The current encoding is quite generous, so I think it is unlikely we
will exceed the 16-bit encoding space. But even if we do, it's possible
to use a spare bit for an "extended" set of tags in the future.
As for 2., the lack of compression is a downside, but I've realized the
only tags that really matter storage-wise are the alt pointers. In any
rbyds there will be roughly O(m log m) alt pointers, but at most O(m) of
any other tags. What this means is that the encoding of any other tag is
in the noise of the encoding of our alt pointers.
Our alt pointers are already pretty densely packed. But because the
sparse key part of alt-pointers are stored as-is, the worst-case
encoding of in-tree tags likely ends up as the encoding of our
alt-pointers. So going up to 3-byte tags adds a surprisingly large
storage cost.
As a minor plus, le16s should be slightly cheaper to encode/decode. It
should also be slightly easier to debug tags on-disk.
tag encoding:
TTTTtttt ttttTTTv
^--------^--^^- 4+3-bit suptype
'---|- 8-bit subtype
'- valid bit
iiii iiiiiii iiiiiii iiiiiii iiiiiii
^- m-bit id/weight
llll lllllll lllllll lllllll lllllll
^- m-bit length/jump
Also renamed the "mk" tags, since they no longer have special behavior
outside of providing names for entries:
- LFSR_TAG_MK => LFSR_TAG_NAME
- LFSR_TAG_MKBRANCH => LFSR_TAG_BNAME
- LFSR_TAG_MKREG => LFSR_TAG_REG
- LFSR_TAG_MKDIR => LFSR_TAG_DIR
This really just required care around calculating the expected B-tree id
and rbyd id (which are different!).
B-tree append, aka B-tree push with id=weight, is actually the outlier.
We need a B-tree id that can identify the rbyd we're appending to, but
this id itself doesn't exist in the tree yet, which can be a bit tricky.
If we combine rbyd ids and B-tree weights, we need 32-bit ids since this
will eventually need to cover the full range of a file. This simply
doesn't fit into a single word anymore, unless littlefs uses 64-bit tags.
Generally not a great idea for a filesystem targeting even 8-bit
microcontrollers.
So here is a tag encoding that uses 3 leb128 words. This will likely
have more code cost and slightly more disk usage (we can no longer fit
tags into 2 bytes), though with most tags being alt pointers (O(m log m)
vs O(m)), this may not be that significant.
Note that we try to keep tags limited to 14-bits to avoid an extra leb128 byte,
which would likely affect all alt pointers. To pull this off we do away
with the subtype/suptype distinction, limiting in-tree tag types to
10-bits encoded on a per-suptype basis:
in-tree tags:
ttttttt ttt00rv
^--^^- 10-bit type
'|- removed bit
'- valid bit
iiii iiiiiii iiiiiii iiiiiii iiiiiii
^- n-bit id
lllllll lllllll lllllll lllllll
^- m-bit length
out-of-tree tags:
ttttttt ttt010v
^---^- 10-bit type
'- valid bit
0000000
lllllll lllllll lllllll lllllll
^- m-bit length
alt tags:
kkkkkkk kkk1dcv
^-^^^- 10-bit key
'||- direction bit
'|- color bit
'- valid bit
wwww wwwwwww wwwwwww wwwwwww wwwwwww
^- n-bit weight
jjjjjjj jjjjjjj jjjjjjj jjjjjjj
^- m-bit jump
The real pain is that with separate integers for id and tag, it no
longer makes sense to combine these into one big weight field. This
requires a significant rewrite.
- Unless there is a bug, rbyd trees should be strictly <= (2*log2(n)+1)
in height. The extra +1 from traditional red-black trees is due to the
introduced to-be-pruned alt, but since we clean those up as soon as we
can, only one will ever exist in any search path.
This also holds true with range deletion, however the definition of n
changes to the number of tree operations. This is the same for tombstoning.
- Delete-all recovery is a bit tricky because we have no tree at that
point, which is weird for an append-only data-structure.
- primitive lfs_rbyd_fetch
- primitive lfs_rbyd_commit
- tag reading/progging and encoding machinery
The tag encoding scheme here uses pairs of leb128s, encoding either
a normal tag:
iiii iiiiiii iiiiiTT TTTTTTt ttttt0v
^--------^------^-^- 16-bit id
'------|-|- 8-bit type2
'-|- 6-bit type1
'- valid bit
llll lllllll lllllll lllllll lllllll
^- n-bit length
Or an alt pointer:
wwww wwwwwww wwwwwww wwwwwww wwwcd1v
^^^-^- 28-bit weight
'|-|- color bit
'-|- direction bit
'- valid bit
jjjj jjjjjjj jjjjjjj jjjjjjj jjjjjjj
^- n-bit jump
Note that two bits overlap the alt pointer dir/color encoding, this
is actually not a problem at all since some tags (crcs/fcrcs) don't
participate in the rbyd tree and can use these bits.
There's a number of benefits to using leb128s, which should probably
be written about, most notably is the abstraction of the device's
word-size. The "n-bits" above can be whatever word size works on the
device, trading off code-size for storage capabilities without breaking
compatibility with other devices. This will eventually be negotiated via
the superblock.
crc32c, with a polynomial of 0x11edc6f41, is generally numerically superior
to the more common crc32 standard. Catching more bit errors across
a wider range of message messages (except 2-bit errors) without any changes
to the underlying algorithm.
Philip Koopman has a large body of work exploring optimal polynomials here:
http://users.ece.cmu.edu/~koopman/crc/crc32.html
And from his experiments we know the maximum message size where we can
still detect a given number of bit errors for each polynomial:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
crc32 0x104c11db7 = ∞ 4294967263 91607 2974 268 171 91 57 34 21 12 10 10 10 - - -
crc32c 0x11edc6f41 = ∞ 2147483615 2147483615 5243 5243 177 177 47 47 20 20 8 8 6 6 1 1
So really crc32c should be prefered where possible. Koopman also has
alternative polynomials with slightly different properties, but crc32c
is already popular enough to have a decent amount of hardware support.
---
Another nice feature of crc32c is that its polynomial has even parity.
It turns out that even-parity polynomials give us the nifty property
parity(crc(m)) == parity(m).
A quick proof:
crc(m) = m(x) x^|P|-1 mod P(x)
parity(m) = m(x) x mod x+1
though note: x mod x+1 = 1, by hand
so:
parity(m) = m(x)*1 mod x+1
= m(x) mod x+1
solving for parity(crc(m)):
parity(crc(m)) = (m(x) x^|P|-1 mod P(x)) mod x+1
note: (a mod b) mod c = a mod c, if c divides b,
aka (a mod b) mod c = a mod c, if b mod c = 0
so if P(x) mod x+1 = 0,
aka if parity(P) = 0:
parity(crc(m)) = m(x) x^|P|-1 mod x+1
but, like before: x^|P|-1 mod x+1 = 1, by hand
so:
parity(crc(m)) = m(x)*1 mod x+1
= m(x) mod x+1
= parity(m)
so if parity(P) = 0:
parity(crc(m)) = parity(m)
This has the potential to replace the 1-bit counter in the metadata tags
with a more general solution that doesn't require extra state.
The logic for endiannes conversion was wrong when LFS_NO_INTRINSICS was
set, since on endinanes match a check of that macro would prevent the
unchanged value from being returned.
Signed-off-by: Carles Cufi <carles.cufi@nordicsemi.no>
The main change here from the previous test framework design is:
1. Powerloss testing remains in-process, speeding up testing.
2. The state of a test, included all powerlosses, is encoded in the
test id + leb16 encoded powerloss string. This means exhaustive
testing can be run in CI, but then easily reproduced locally with
full debugger support.
For example:
./scripts/test.py test_dirs#reentrant_many_dir#10#1248g1g2 --gdb
Will run the test test_dir, case reentrant_many_dir, permutation #10,
with powerlosses at 1, 2, 4, 8, 16, and 32 cycles. Dropping into gdb
if an assert fails.
The changes to the block-device are a work-in-progress for a
lazily-allocated/copy-on-write block device that I'm hoping will keep
exhaustive testing relatively low-cost.
This is useful for testing the new erroring assert behavior in CI.
Asserts do not error by default, so this macro needs to be overriden.
It is possible to test this behavior using the existing option of
overriding lfs_util.h with a custom file, by using a small sed
one-line script. But this is much simpler.
This does raise the question if more of the configuration options in
lfs_util.h should be opened up for function-like macro overrides.
__VA_ARGS__ are frustrating in C. Even for their main purpose (printf),
they fall short in that they don't have a _portable_ way to have zero
arguments after the format string in a printf call.
Even if we detect compilers and use ##__VA_ARGS__ where available, GCC
emits a warning with -pedantic that is _impossible_ to explicitly
disable.
This commit contains the best solution we can think of. A bit of
indirection that adds a hidden "%s" % "" to the end of the format
string. This solution does not work everywhere as it has a runtime
cost, but it is hopefully ok for debug statements.
This is the start of reworking littlefs's testing framework based on
lessons learned from the initial testing framework.
1. The testing framework needs to be _flexible_. It was hacky, which by
itself isn't a downside, but it wasn't _flexible_. This limited what
could be done with the tests and there ended up being many
workarounds just to reproduce bugs.
The idea behind this revamped framework is to separate the
description of tests (tests/test_dirs.toml) and the running of tests
(scripts/test.py).
Now, with the logic moved entirely to python, it's possible to run
the test under varying environments. In addition to the "just don't
assert" run, I'm also looking to run the tests in valgrind for memory
checking, and an environment with simulated power-loss.
The test description can also contain abstract attributes that help
control how tests can be ran, such as "leaky" to identify tests where
memory leaks are expected. This keeps test limitations at a minimum
without limiting how the tests can be ran.
2. Multi-stage-process tests didn't really add value and limited what
the testing environment.
Unmounting + mounting can be done in a single process to test the
same logic. It would be really difficult to make this fail only
when memory is zeroed, though that can still be caught by
power-resilient tests.
Requiring every test to be a single process adds several options
for test execution, such as using a RAM-backed block device for
speed, or even running the tests on a device.
3. Added fancy assert interception. This wasn't really a requirement,
but something I've been wanting to experiment with for a while.
During testing, scripts/explode_asserts.py is added to the build
process. This is a custom C-preprocessor that parses out assert
statements and replaces them with _very_ verbose asserts that
wouldn't normally be possible with just C macros.
It even goes as far as to report the arguments to strcmp, since the
lack of visibility here was very annoying.
tests_/test_dirs.toml:186:assert: assert failed with "..", expected eq "..."
assert(strcmp(info.name, "...") == 0);
One downside is that simply parsing C in python is slower than the
entire rest of the compilation, but fortunately this can be
alleviated by parallelizing the test builds through make.
Other neat bits:
- All generated files are a suffix of the test description, this helps
cleanup and means it's (theoretically) possible to parallelize the
tests.
- The generated test.c is shoved base64 into an ad-hoc Makefile, this
means it doesn't force a rebuild of tests all the time.
- Test parameterizing is now easier.
- Hopefully this framework can be repurposed also for benchmarks in the
future.
To use, compile and run with LFS_YES_TRACE defined:
make CFLAGS+=-DLFS_YES_TRACE=1 test_format
The name LFS_YES_TRACE was chosen to match the LFS_NO_DEBUG and
LFS_NO_WARN defines for the similar levels of output. The YES is
necessary to avoid a conflict with the actual LFS_TRACE macro that
gets emitting. LFS_TRACE can also be defined directly to provide
a custom trace formatter.
Hopefully having trace statements at the littlefs C API helps
debugging and reproducing issues.
In v2, the lookahead_buffer was changed from requiring 4-byte alignment
to requiring 8-byte alignment. This was not documented as well as it
could be, and as FabianInostroza noted, this also implies that
lfs_malloc must provide 8-byte alignment.
To protect against this, I've also added an assert on the alignment of
both the lookahead_size and lookahead_buffer.
found by FabianInostroza and amitv87
The main difference here is a change from encoding "hasorphans" and
"hasmove" bits in the tag itself. This worked with the old format, but
in the new format the space these bits take up must be consistent for
each tag type. The tradeoff is that the new tag format allows for up to
256 different global states which may be useful in the future (for
example, a global free list).
The new format encodes this info in the data blob, using an additional
word of storage. This word is actually formatted the same as though it
was a tag, which simplified internal handling and may allow other tag
types in the future.
Format for global state:
[---- 96 bits ----]
[1|- 11 -|- 10 -|- 10 -|--- 64 ---]
^ ^ ^ ^ ^- move dir pair
| | | \-------------------------- unused, must be 0s
| | \--------------------------------- move id
| \---------------------------------------- type, 0xfff for move
\--------------------------------------------- has orphans
This also included another iteration over globals (renamed to gstate)
with some simplifications to how globals are handled.
There was an interesting subtlety with the existing layout of tags that
could become a problem in the future. Basically, littlefs avoids writing to
any region of storage it is not absolutely sure has been erased
beforehand. This is a part of limiting the number of assumptions about
storage. It's possible a storage technology can't support writes without
erases in a way that is undetectable at write time (Maybe changing a bit
without an erase decreases the longevity of the information stored on
the bit).
But the existing layout had a very tiny corner case where this wasn't
true. Consider the location of the valid bit in the tag struct:
[1|--- 31 ---]
^--- valid bit
The responsibility of this bit is to indicate if an attempt has been
made to write the following commit. If it is not set (the specific value
is dependent on a previous read and identified by the preceeding commit),
the assumption is that it is safe to write to the next region because it
has been erased previously. If it is set, we check if the next commit is
valid, if it isn't (because of CRC failure, likely due to power-loss), we
discard the commit. But because an attempt has been made to write to
that storage, we must then do a compaction to move to the other block in
the metadata-pair.
This plan looks good on paper, but what does it look like on storage?
The problem is that words in littlefs are in little-endian. So on
storage the tag actually looks like this:
[- 8 -|- 8 -|- 8 -|1|- 7 -]
^-- valid bit
This means that we don't actually set the valid bit before writing the
tag! We write the lower bytes first. If we lose power, we may have
written 3 bytes without this fact being detectable.
We could restructure the tag structure to store the valid bit lower,
however because none of the fields are 7 bits, this would make the
extraction more costly, and we then lose the ability to check this
valid bit with a sign comparison.
The simple solution is to just store the tag in big-endian. A small
benefit is that this will actually have a negative code cost on
big-endian machines.
This mixture of endiannesses is frustrating, however it is a pragmatic
solution with only a 20-byte code size cost.
Found while testing big-endian support. Basically, if littlefs is really
really unlucky, the block allocator could kick in while committing a
file's CTZ reference. If this happens, the block allocator will need to
traverse all CTZ skip-lists in memory, including the skip-list we're
committing. This means we can't convert the CTZ's endianness in place,
and need to make a copy on big-endian systems.
We rely on dead-code elimination from the compiler to make the
conditional behaviour for big-endian vs little-endian system a noop
determined by the lfs_tole32 intrinsic.
In looking at the common CRC APIs out there, this seemed the most
common. At least more common than the current modified-in-place pointer
API. It also seems to have a slightly better code footprint. I'm blaming
pointer optimization issues.
One downside is that lfs_crc can't report errors, however it was already
assumed that lfs_crc can not error.
The introduction of an explicit cache_size configuration allows
customization of the cache buffers independently from the hardware
read/write sizes.
This has been one of littlefs's main handicaps. Without a distinction
between cache units and hardware limitations, littlefs isn't able to
read or program _less_ than the cache size. This leads to the
counter-intuitive case where larger cache sizes can actually be harmful,
since larger read/prog sizes require sending more data over the bus if
we're only accessing a small set of data (for example the CTZ skip-list
traversal).
This is compounded with metadata logging, since a large program size
limits the number of commits we can write out in a single metadata
block. It really doesn't make sense to link program size + cache
size here.
With a separate cache_size configuration, we can be much smarter about
what we actually read/write from disk.
This also simplifies cache handling a bit. Before there were two
possible cache sizes, but these were rarely used. Note that the
cache_size is NOT written to the superblock and can be freely changed
without breaking backwards compatibility.
This is a big change stemming from the fact that resizable entries
were surprisingly complicated to implement and came in with a sizable
code cost.
The theory is that the journalling has a comparable cost to resizable
entries. Both need to handle overflowing blocks, and managing offsets is
comparable to managing attribute IDs. But by jumping all the way to full
journaling, we can statically wear-level the metadata written to
metadata pairs.
The idea of journaling littlefs's metadata has been mentioned several times in
discussions and fits well into how littlefs works. You could even view the
existing metadata log as a log of size 2.
The downside of this approach is that changing the metadata in this way
would break compatibility from the existing layout on disk. Something
that resizable entries does not do.
That being said, adopting journaling at the metadata layer offers a big
improvement to littlefs's performance and wear-leveling, with very
little cost (maybe even none or negative after resizable entries?).
- Fixed shadowed variable warnings in lfs_dir_find.
- Fixed unused parameter warnings when LFS_NO_MALLOC is enabled.
- Added extra warning flags to CFLAGS.
- Updated tests so they don't shadow the "size" variable for -Wshadow
Suggested by sn00pster, LFS_CONFIG is an opt-in user provided
configuration file that will override the util implementation in
lfs_util.h. This is useful for allowing system-specific overrides
without needing to rely on git merges or other forms of patching
for updates.
Note: It's still expected to modify lfs_utils.h when porting littlefs
to a new target/system. There's just too much room for system-specific
improvements, such as taking advantage of CRC hardware.
Rather, encouraging modification of lfs_util.h and making it easy to
modify and debug should result in better integration with the consuming
systems.
This just adds a bunch of quality-of-life improvements that should help
development and integration in littlefs.
- Macros that require no side-effects are all-caps
- System includes are only brought in when needed
- Malloc/free wrappers
- LFS_NO_* checks for quickly disabling things at the command line
- At least a little-bit more docs
Required to support big-endian processors, with the most notable being
the PowerPC architecture.
On little-endian architectures, these conversions can be optimized out
and have no code impact.
Initial patch provided by gmouchard
This helps significantly with supporting different compilers. Intrinsics for
different compilers can be added as they are found.
Note that for ARMCC, __builtin_ctz is not used. This was the result of a
strange issue where ARMCC only emits __builtin_ctz when passed the
--gnu flag, but __builtin_clz and __builtin_popcount are always emitted.
This isn't a big problem since the ARM instruction set doesn't have a
ctz instruction, and the npw2 based implementation is one of the most
efficient.
Also note that for littefs's purposes, we consider ctz(0) to be
undefined. This lets us save a branch in the software lfs_ctz
implementation.
This reduces the O(n^2logn) runtime to read a file to only O(nlog).
The extra O(n) did not touch the disk, so it isn't a problem until the
files become very large, but this solution comes with very little cost.
Long story short, you can find the block index + offset pair for a
CTZ linked-list with this series of formulas:
n' = floor(N / (B - 2w/8))
N' = (B - 2w/8)n' + (w/8)popcount(n')
off' = N - N'
n, off =
n'-1, off'+B if off' < 0
n', off'+(w/8)(ctz(n')+1) if off' >= 0
For the long story, you will need to see the updated DESIGN.md
Before, the littlefs relied on the underlying block device
to report corruption that occurs when writing data to disk.
This requirement is easy to miss or implement incorrectly, since
the error detection is only required when a block becomes corrupted,
which is very unlikely to happen until late in the block device's
lifetime.
The littlefs can detect corruption itself by reading back written data.
This requires a bit of care to reuse the available buffers, and may rely
on checksums to avoid additional RAM requirements.
This does have a runtime penalty with the extra read operations, but
should make the littlefs much more robust to different implementations.
Adopted buffer followed by size. The other order was original
chosen due to some other functions with a more complicated
parameter list.
This convention is important, as the bd api is one of the main
apis facing porting efforts.
This adds caching of the most recent read/program blocks, allowing
support of devices that don't have byte-level read+writes, along
with reduced device access on devices that do support byte-level
read+writes.
Note: The current implementation is a bit eager to drop caches where
it simplifies the cache layer. This layer is already complex enough.
Note: It may be worthwhile to add a compile switch for caching to
reduce code size, note sure.
Note: This does add a dependency on malloc, which could have a porting
layer, but I'm just using the functions from stdlib for now. These can be
overwritten with noops if the user controls the system, and keeps things
simple for now.
After quite a bit of prototyping, settled on the following functions:
- lfs_dir_alloc - create a new dir
- lfs_dir_fetch - load and check a dir pair from disk
- lfs_dir_commit - save a dir pair to disk
- lfs_dir_shift - shrink a dir pair to disk
- lfs_dir_append - add a dir entry, creating dirs if needed
- lfs_dir_remove - remove a dir entry, dropping dirs if needed
Additionally, followed through with a few other tweaks