This was quite a puzzle.
The problem: How do we detect corrupt mdirs?
Seems like a simple question, but we can't just rely on mdir cksums. Our
mdirs are independently updateable logs, and logs have this annoying
tendency to "rollback" to previously valid states when corrupted.
Rollback issues aren't littlefs-specific, but what _is_ littlefs-
specific is that when one mdir rolls back, it can disagree with other
mdirs, resulting in wildly incorrect filesystem state.
To solve this, or at least protect against disagreeable mdirs, we need
to somehow include the state of all other mdirs in each mdir commit.
---
The first thought: Why not use gstate?
We already have a system for storing distributed state. If we add the
xor of all of our mdir cksums, we can rebuild it during mount and verify
that nothing changed:
.--------. .--------. .--------. .--------.
.| mdir 0 | .| mdir 1 | .| mdir 2 | .| mdir 3 |
|| | || | || | || |
|| gdelta | || gdelta | || gdelta | || gdelta |
|'-----|--' |'-----|--' |'-----|--' |'-----|--'
'------|-' '------|-' '------|-' '------|-'
'--.------' '--.------' '--.------' '--.------'
cksum | cksum | cksum | cksum |
| | v | v | v |
'---------> xor -------> xor -------> xor -------> gcksum
| v v v =?
'---------> xor -------> xor -------> xor ---> gcksum
Unfortunately it's not that easy. Consider what this looks like
mathematically (g is our gcksum, c_i is an mdir cksum, d_i is a
gcksumdelta, and +/-/sum is xor):
g = sum(c_i) = sum(d_i)
If we solve for a new gcksumdelta, d_i:
d_i = g' - g
d_i = g + c_i - g
d_i = c_i
The gcksum cancels itself out! We're left with an equation that depends
only on the current mdir, which doesn't help us at all.
Next thought: What if we permute the gcksum with a function t before
distributing it over our gcksumdeltas?
.--------. .--------. .--------. .--------.
.| mdir 0 | .| mdir 1 | .| mdir 2 | .| mdir 3 |
|| | || | || | || |
|| gdelta | || gdelta | || gdelta | || gdelta |
|'-----|--' |'-----|--' |'-----|--' |'-----|--'
'------|-' '------|-' '------|-' '------|-'
'--.------' '--.------' '--.------' '--.------'
cksum | cksum | cksum | cksum |
| | v | v | v |
'---------> xor -------> xor -------> xor -------> gcksum
| | | | .--t--'
| | | | '-> t(gcksum)
| v v v =?
'---------> xor -------> xor -------> xor ---> t(gcksum)
In math terms:
t(g) = t(sum(c_i)) = sum(d_i)
In order for this to work, t needs to be non-linear. If t is linear, the
same thing happens:
d_i = t(g') - t(g)
d_i = t(g + c_i) - t(g)
d_i = t(g) + t(c_i) - t(g)
d_i = t(c_i)
This was quite funny/frustrating (funnistrating?) during development,
because it means a lot of seemingly obvious functions don't work!
- t(g) = g - Doesn't work
- t(g) = crc32c(g) - Doesn't work because crc32cs are linear
- t(g) = g^2 in GF(2^n) - g^2 is linear in GF(2^n)!?
Fortunately, powers coprime with 2 finally give us a non-linear function
in GF(2^n), so t(g) = g^3 works:
d_i = g'^3 - g^3
d_i = (g + c_i)^3 - g^3
d_i = (g^2 + gc_i + gc_i + c_i^2)(g + c_i) - g^3
d_i = (g^2 + c_i^2)(g + c_i) - g^3
d_i = g^3 + gc_i^2 + g^2c_i + c_i^3 - g^3
d_i = gc_i^2 + g^2c_i + c_i^3
---
Bleh, now we need to implement finite-field operations? Well, not
entirely!
Note that our algorithm never uses division. This means we don't need a
full finite-field (+, -, *, /), but can get away with a finite-ring (+,
-, *). And conveniently for us, our crc32c polynomial defines a ring
epimorphic to a 31-bit finite-field.
All we need to do is define crc32c multiplication as polynomial
multiplication mod our crc32c polynomial:
crc32cmul(a, b) = pmod(pmul(a, b), P)
And since crc32c is more-or-less just pmod(x, P), this lets us take
advantage of any crc32c hardware/tables that may be available.
---
Bunch of notes:
- Our 2^n-bit crc-ring maps to a 2^n-1-bit finite-field because our crc
polynomial is defined as P(x) = Q(x)(x + 1), where Q(x) is a 2^n-1-bit
irreducible polynomial.
This is a common crc construction as it provides optimal odd-bit/2-bit
error detection, so it shouldn't be too difficult to adapt to other
crc sizes.
- t(g) = g^3 is not the only function that works, but it turns out to be
a pretty good one:
- 3 and 2^(2^n-1)-1 are coprime, which means our function t(g) = g^3
provides a one-to-one mapping in the underlying fields of all crc
rings of size 2^(2^n).
We know 3 and 2^(2^n-1)-1 are coprime because 2^(2^n-1)-1 =
2^(2^n)-1 (a Fermat number) - 2^(2^n-1) (a power-of-2), and 3
divides Fermat numbers >=3 (A023394) and is not 2.
- Our delta, when viewed as a polynomial in g: d(g) = gc^2 + g^2c +
c^3, has degree 2, which implies there are at most 2 solutions or
1-bit of information loss in the underlying field.
This is optimal since the original definition already had 2
solutions before we even chose a function:
d(g) = t(g + c) - t(g)
d(g) = t(g + c) - t((g + c) - c)
d(g) = t((g + c) + c) - t(g + c)
d(g) = d(g + c)
Though note the mapping of our crc-ring to the underlying field
already represents 1-bit of information loss.
- If you're using a cryptographic hash or other non-crc, you should
probably just use an equal sized finite-field.
Though note changing from a 2^n-1-bit field to a 2^n-bit field does
change the math a bit, with t(g) = g^7 being a better non-linear
function:
- 7 is the smallest odd-number coprime with 2^n-1, a Fermat number,
which makes t(g) = g^7 a one-to-one mapping.
3 humorously divides all 2^n-1 Fermat numbers.
- Expanding delta with t(g) = g^7 gives us a 6 degree polynomial,
which implies at most 6 solutions or ~3-bits of information loss.
This isn't actually the best you can do, some exhaustive searching
over small fields (<=2^16) suggests t(g) = g^(2^(n-1)-1) _might_ be
optimal, but that's a heck of a lot more multiplications.
- Because our crc32cs preserve parity/are epimorphic to parity bits,
addition (xor) and multiplication (crc32cmul) also preserve parity,
which can be used to show our entire gcksum system preserves parity.
This is quite neat, and means we are guaranteed to detect any odd
number of bit-errors across the entire filesystem.
- Another idea was to use two different addition operations: xor and
overflowing addition (or mod a prime).
This probably would have worked, but lacks the rigor of the above
solution.
- You might think an RS-like construction would help here, where g =
sum(c_ia^i), but this suffers from the same problem:
d_i = g' - g
d_i = g + c_ia^i - g
d_i = c_ia^i
Nothing here depends on anything outside of the current mdir.
- Another question is should we be using an RS-like construction anyways
to include location information in our gcksum?
Maybe in another system, but I don't think it's necessary in littlefs.
While our mdir are independently updateable, they aren't _entirely_
independent. The location of each mdir is stored in either the mtree
or a parent mdir, so it always gets mixed into the gcksum somewhere.
The only exception being the mrootanchor which is always at the fixed
blocks 0x{0,1}.
- This does _not_ catch "global-rollback" issues, where the most recent
commit in the entire filesystem is corrupted, revealing an older, but
still valid, filesystem state.
But as far as I am aware this is just a fundamental limitation of
powerloss-resilient filesystems, short of doing destructive
operations.
At the very least, exposing the gcksum would allow the user to store
it externally and prevent this issue.
---
Implementation details:
- Our gcksumdelta depends on the rbyd's cksum, so there's a catch-22 if
we include it in the rbyd itself.
We can avoid this by including it in the commit tags (actually the
separate canonical cksum makes this easier than it would have been
earlier), but this does mean LFSR_TAG_GCKSUMDELTA is not an
LFSR_TAG_GDELTA subtype. Unfortunate but not a dealbreaker.
- Reading/writing the gcksumdelta gets a bit annoying with it not being
in the rbyd. For now I've extended the low-level lfsr_rbyd_fetch_/
lfsr_rbyd_appendcksum_ to accept an optional gcksumdelta pointer,
which is a bit awkward, but I don't know of a better solution.
- Unlike the grm, _every_ mdir commit involves the gcksum, which means
we either need to propagate the gcksumdelta up the mroot chain
correctly, or somehow keep track of partially flushed gcksumdeltas.
To make this work I modified the low-level lfsr_mdir_commit__
functions to accept start_rid=-2 to indicate when gcksumdeltas should
be flushed.
It's a bit of a hack, but I think it might make sense to extend this
to all gdeltas eventually.
The gcksum cost both code and RAM, but I think it's well worth it for
removing an entire category of filesystem corruption:
code stack ctx
before: 37796 2608 620
after: 38428 (+1.7%) 2640 (+1.2%) 644 (+3.9%)
littlefs
A little fail-safe filesystem designed for microcontrollers.
| | | .---._____
.-----. | |
--|o |---| littlefs |
--| |---| |
'-----' '----------'
| | |
Power-loss resilience - littlefs is designed to handle random power failures. All file operations have strong copy-on-write guarantees and if power is lost the filesystem will fall back to the last known good state.
Dynamic wear leveling - littlefs is designed with flash in mind, and provides wear leveling over dynamic blocks. Additionally, littlefs can detect bad blocks and work around them.
Bounded RAM/ROM - littlefs is designed to work with a small amount of memory. RAM usage is strictly bounded, which means RAM consumption does not change as the filesystem grows. The filesystem contains no unbounded recursion and dynamic memory is limited to configurable buffers that can be provided statically.
Example
Here's a simple example that updates a file named boot_count every time
main runs. The program can be interrupted at any time without losing track
of how many times it has been booted and without corrupting the filesystem:
#include "lfs.h"
// variables used by the filesystem
lfs_t lfs;
lfs_file_t file;
// configuration of the filesystem is provided by this struct
const struct lfs_config cfg = {
// block device operations
.read = user_provided_block_device_read,
.prog = user_provided_block_device_prog,
.erase = user_provided_block_device_erase,
.sync = user_provided_block_device_sync,
// block device configuration
.read_size = 16,
.prog_size = 16,
.block_size = 4096,
.block_count = 128,
.cache_size = 16,
.lookahead_size = 16,
.block_cycles = 500,
};
// entry point
int main(void) {
// mount the filesystem
int err = lfs_mount(&lfs, &cfg);
// reformat if we can't mount the filesystem
// this should only happen on the first boot
if (err) {
lfs_format(&lfs, &cfg);
lfs_mount(&lfs, &cfg);
}
// read current count
uint32_t boot_count = 0;
lfs_file_open(&lfs, &file, "boot_count", LFS_O_RDWR | LFS_O_CREAT);
lfs_file_read(&lfs, &file, &boot_count, sizeof(boot_count));
// update boot count
boot_count += 1;
lfs_file_rewind(&lfs, &file);
lfs_file_write(&lfs, &file, &boot_count, sizeof(boot_count));
// remember the storage is not updated until the file is closed successfully
lfs_file_close(&lfs, &file);
// release any resources we were using
lfs_unmount(&lfs);
// print the boot count
printf("boot_count: %d\n", boot_count);
}
Usage
Detailed documentation (or at least as much detail as is currently available) can be found in the comments in lfs.h.
littlefs takes in a configuration structure that defines how the filesystem operates. The configuration struct provides the filesystem with the block device operations and dimensions, tweakable parameters that tradeoff memory usage for performance, and optional static buffers if the user wants to avoid dynamic memory.
The state of the littlefs is stored in the lfs_t type which is left up
to the user to allocate, allowing multiple filesystems to be in use
simultaneously. With the lfs_t and configuration struct, a user can
format a block device or mount the filesystem.
Once mounted, the littlefs provides a full set of POSIX-like file and directory functions, with the deviation that the allocation of filesystem structures must be provided by the user.
All POSIX operations, such as remove and rename, are atomic, even in event of power-loss. Additionally, file updates are not actually committed to the filesystem until sync or close is called on the file.
Other notes
Littlefs is written in C, and specifically should compile with any compiler
that conforms to the C99 standard.
All littlefs calls have the potential to return a negative error code. The
errors can be either one of those found in the enum lfs_error in
lfs.h, or an error returned by the user's block device operations.
In the configuration struct, the prog and erase function provided by the
user may return a LFS_ERR_CORRUPT error if the implementation already can
detect corrupt blocks. However, the wear leveling does not depend on the return
code of these functions, instead all data is read back and checked for
integrity.
If your storage caches writes, make sure that the provided sync function
flushes all the data to memory and ensures that the next read fetches the data
from memory, otherwise data integrity can not be guaranteed. If the write
function does not perform caching, and therefore each read or write call
hits the memory, the sync function can simply return 0.
Design
At a high level, littlefs is a block based filesystem that uses small logs to store metadata and larger copy-on-write (COW) structures to store file data.
In littlefs, these ingredients form a sort of two-layered cake, with the small logs (called metadata pairs) providing fast updates to metadata anywhere on storage, while the COW structures store file data compactly and without any wear amplification cost.
Both of these data structures are built out of blocks, which are fed by a common block allocator. By limiting the number of erases allowed on a block per allocation, the allocator provides dynamic wear leveling over the entire filesystem.
root
.--------.--------.
| A'| B'| |
| | |-> |
| | | |
'--------'--------'
.----' '--------------.
A v B v
.--------.--------. .--------.--------.
| C'| D'| | | E'|new| |
| | |-> | | | E'|-> |
| | | | | | | |
'--------'--------' '--------'--------'
.-' '--. | '------------------.
v v .-' v
.--------. .--------. v .--------.
| C | | D | .--------. write | new E |
| | | | | E | ==> | |
| | | | | | | |
'--------' '--------' | | '--------'
'--------' .-' |
.-' '-. .-------------|------'
v v v v
.--------. .--------. .--------.
| F | | G | | new F |
| | | | | |
| | | | | |
'--------' '--------' '--------'
More details on how littlefs works can be found in DESIGN.md and SPEC.md.
-
DESIGN.md - A fully detailed dive into how littlefs works. I would suggest reading it as the tradeoffs at work are quite interesting.
-
SPEC.md - The on-disk specification of littlefs with all the nitty-gritty details. May be useful for tooling development.
Testing
The littlefs comes with a test suite designed to run on a PC using the
emulated block device found in the bd directory.
The tests assume a Linux environment and can be started with make:
make test
License
The littlefs is provided under the BSD-3-Clause license. See LICENSE.md for more information. Contributions to this project are accepted under the same license.
Individual files contain the following tag instead of the full license text.
SPDX-License-Identifier: BSD-3-Clause
This enables machine processing of license information based on the SPDX License Identifiers that are here available: http://spdx.org/licenses/
Related projects
-
littlefs-fuse - A FUSE wrapper for littlefs. The project allows you to mount littlefs directly on a Linux machine. Can be useful for debugging littlefs if you have an SD card handy.
-
littlefs-js - A javascript wrapper for littlefs. I'm not sure why you would want this, but it is handy for demos. You can see it in action here.
-
littlefs-python - A Python wrapper for littlefs. The project allows you to create images of the filesystem on your PC. Check if littlefs will fit your needs, create images for a later download to the target memory or inspect the content of a binary image of the target memory.
-
mklfs - A command line tool built by the Lua RTOS guys for making littlefs images from a host PC. Supports Windows, Mac OS, and Linux.
-
Mbed OS - The easiest way to get started with littlefs is to jump into Mbed which already has block device drivers for most forms of embedded storage. littlefs is available in Mbed OS as the LittleFileSystem class.
-
SPIFFS - Another excellent embedded filesystem for NOR flash. As a more traditional logging filesystem with full static wear-leveling, SPIFFS will likely outperform littlefs on small memories such as the internal flash on microcontrollers.
-
Dhara - An interesting NAND flash translation layer designed for small MCUs. It offers static wear-leveling and power-resilience with only a fixed O(|address|) pointer structure stored on each block and in RAM.