- Added support for negative numbers in the leb16 encoding with an
optional 'w' prefix.
- Changed prettyasserts.py rule to .a.c => .c, allowing other .a.c files
in the future.
- Updated .gitignore with missing generated files (tags, .csv).
- Removed suite-namespacing of test symbols, these are no longer needed.
- Changed test define overrides to have higher priority than explicit
defines encoded in test ids. So:
./runners/bench_runner bench_dir_open:0f1g12gg2b8c8dgg4e0 -DREAD_SIZE=16
Behaves as expected.
Otherwise it's not easy to experiment with known failing test cases.
- Fixed issue where the -b flag ignored explicit test/bench ids.
Driven primarily by a want to compare measurements of different runtime
complexities (it's difficult to fit O(n) and O(log n) on the same plot),
this adds the ability to nest subplots in the same .svg which try to align
as much as possible. This turned out to be surprisingly complicated.
As a part of this, adopted matplotlib's relatively recent
constrained_layout, which behaves much more consistently.
Also dropped --legend-left, no one should really be using that.
The difference between ggplot's gray and GitHub's gray was a bit jarring.
This also adds --foreground and --font-color for this sort of additional
color control without needing to add a new flag for every color scheme
out there.
- Renamed struct_.py -> structs.py again.
- Removed lfs.csv, instead prefering script specific csv files.
- Added *-diff make rules for quick comparison against a previous
result, results are now implicitly written on each run.
For example, `make code` creates lfs.code.csv and prints the summary, which
can be followed by `make code-diff` to compare changes against the saved
lfs.code.csv without overwriting.
- Added nargs=? support for -s and -S, now uses a per-result _sort
attribute to decide sort if fields are unspecified.
For long running processes (testing with >1pls) these logs can grow into
multiple gigabytes, humorously we never access more than the last n lines
as requested by --context. Piping the stdout with --stdout does not use
additional RAM.
- Fixed added/removed count in scripts when an entry has no field in
the expected results
- Fixed a python-sort-type issue when by-field is missing in a result
This happens in rare situations where there is a failed mdir relocation,
interrupted by a power-loss, containing the destination of a directory
rename operation, where the directory being renamed preceded the
relocating mdir in the mdir tail-list. This requires at some point for a
previous directory rename to create a cycle.
If this happens, it's possible for the half-orphan to contain the only
reference to the renamed directory. Since half-orphans contain outdated
state when viewed through the mdir tail-list, the renamed directory
appears to be a full-orphan until we fix the relocating half-orphan.
This causes littlefs to incorrectly remove the renamed directory from
the mdir tail-list, causes catastrophic problems down the line.
The source of the problem is that the two different types of orphans
really operate on two different levels of abstraction: half-orphans fix
failed mdir commits, while full-orphans fix directory removes/renames.
Conflating the two leads to situations where we attempt to fix assumed
problems about the directory tree before we have fixed problems with the
mdir state.
The fix here is to separate out the deorphan search into two passes: one
to fix half-orphans and correct any mdir-commits, restoring the mdirs
and gstate to a known good state, then two to fix failed
removes/renames.
---
This was found with the -Plinear heuristic powerloss testing, which now
runs on more geometries. The failing case was:
test_relocations_reentrant_renames:112gg261dk1e3f3:123456789abcdefg1h1i1j1k1
l1m1n1o1p1q1r1s1t1u1v1g2h2i2j2k2l2m2n2o2p2q2r2s2t2
Also fixed/tweaked some parts of the test framework as a part of finding
this bug:
- Fixed off-by-one in exhaustive powerloss state encoding.
- Added --gdb-powerloss-before and --gdb-powerloss-after to help debug
state changes through a failing powerloss, maybe this should be
expanded to any arbitrary powerloss number in the future.
- Added lfs_emubd_crc and lfs_emubd_bdcrc to get block/bd crcs for quick
state comparisons while debugging.
- Fixed bd read/prog/erase counts not being copied during exhaustive
powerloss testing.
- Fixed small typo in lfs_emubd trace.
- Changed --(tool)-tool to --(tool)-path in scripts, this seems to be
a more common name for this sort of flag.
- Changed BUILDDIR to not have implicit slash, makes Makefile internals
a bit more readable.
- Fixed some outdated names hidden in less-often used ifdefs.
Added a couple flags to make the script a bit more flexible, and removed
littlefs-specific default in line with the other scripts which aren't
really littlefs-specific. (These defaults can be moved to the
littlefs-specific Makefile easily enough).
The original behavior can be reproduced like so:
./script/changeprefix.py lfs lfs2 --git
- Fixed prettyasserts.py parsing when '->' is in expr
- Made prettyasserts.py failures not crash (yay dynamic typing)
- Fixed the initial state of the emubd disk file to match the internal
state in RAM
- Fixed true/false getting changed to True/False in test.py/bench.py
defines
- Fixed accidental substring matching in plot.py's --by comparison
- Fixed a missed LFS_BLOCk_CYCLES in test_superblocks.toml that was
missed
- Changed test.py/bench.py -v to only show commands being run
Including the test output is still possible with test.py -v -O-, making
the implicit inclusion redundant and noisy.
- Added license comments to bench_runner/test_runner
Note that plotmpl.py tries to share many arguments with plot.py,
allowing plot.py to act as a sort of draft mode for previewing plots
before creating an svg.
Based loosely on Linux's perf tool, perfbd.py uses trace output with
backtraces to aggregate and show the block device usage of all functions
in a program, propagating block devices operation cost up the backtrace
for each operation.
This combined with --trace-period and --trace-freq for
sampling/filtering trace events allow the bench-runner to very
efficiently record the general cost of block device operations with very
little overhead.
Adopted this as the default side-effect of make bench, replacing
cycle-based performance measurements which are less important for
littlefs.
This adds -P/--propagate and -Z/--depth to perf.py for showing recursive
results, making it easy to narrow down on where spikes in performance
come from.
This ended up being a bit different from stack.py's recursive results,
as we end up with different (diminishing) numbers as we descend.
This provides 2 things:
1. perf integration with the bench/test runners - This is a bit tricky
with perf as it doesn't have its own way to combine perf measurements
across multiple processes. perf.py works around this by writing
everything to a zip file, using flock to synchronize. As a plus, free
compression!
2. Parsing and presentation of perf results in a format consistent with
the other CSV-based tools. This actually ran into a surprising number of
issues:
- We need to process raw events to get the information we want, this
ends up being a lot of data (~16MiB at 100Hz uncompressed), so we
paralellize the parsing of each decompressed perf file.
- perf reports raw addresses post-ASLR. It does provide sym+off which
is very useful, but to find the source of static functions we need to
reverse the ASLR by finding the delta the produces the best
symbol<->addr matches.
- This isn't related to perf, but decoding dwarf line-numbers is
really complicated. You basically need to write a tiny VM.
This also turns on perf measurement by default for the bench-runner, but at a
low frequency (100 Hz). This can be decreased or removed in the future
if it causes any slowdown.
The main change is requiring field names for -b/-f/-s/-S, this
is a bit more powerful, and supports hidden extra fields, but
can require a bit more typing in some cases.
- Changed multi-field flags to action=append instead of comma-separated.
- Dropped short-names for geometries/powerlosses
- Renamed -Pexponential -> -Plog
- Allowed omitting the 0 for -W0/-H0/-n0 and made -j0 consistent
- Better handling of --xlim/--ylim
Instead of trying to align to block-boundaries tracebd.py now just
aliases to whatever dimensions are provided.
Also reworked how scripts handle default sizing. Now using reasonable
defaults with 0 being a placeholder for automatic sizing. The addition
of -z/--cat makes it possible to pipe directly to stdout.
Also added support for dots/braille output which can capture more
detail, though care needs to be taken to not rely on accurate coloring.
Now both scripts also fallback to guessing what fields to use based on
what fields can be converted to integers. This is more falible, and
doesn't work for tests/benchmarks, but in those cases explicit fields
can be used (which is what would be needed without guessing anyways).
These are really just different flavors of test.py and test_runner.c
without support for power-loss testing, but with support for measuring
the cumulative number of bytes read, programmed, and erased.
Note that the existing define parameterization should work perfectly
fine for running benchmarks across various dimensions:
./scripts/bench.py \
runners/bench_runner \
bench_file_read \
-gnor \
-DSIZE='range(0,131072,1024)'
Also added a couple basic benchmarks as a starting point.
- Added the littlefs license note to the scripts.
- Adopted parse_intermixed_args everywhere for more consistent arg
handling.
- Removed argparse's implicit help text formatting as it does not
work with perse_intermixed_args and breaks sometimes.
- Used string concatenation for argparse everywhere, uses backslashed
line continuations only works with argparse because it strips
redundant whitespace.
- Consistent argparse formatting.
- Consistent openio mode handling.
- Consistent color argument handling.
- Adopted functools.lru_cache in tracebd.py.
- Moved unicode printing behind --subscripts in traceby.py, making all
scripts ascii by default.
- Renamed pretty_asserts.py -> prettyasserts.py.
- Renamed struct.py -> struct_.py, the original name conflicts with
Python's built in struct module in horrible ways.
The main benefit is small test ids everywhere, though this is with the
downside of needing longer names to properly prefix and avoid
collisions. But this fits into the rest of the scripts with globally
unique names a bit better. This is a C project after all.
The other small benefit is test generators may have an easier time since
per-case symbols can expect to be unique.
With more scripts generating CSV files this moves most CSV manipulation
into summary.py, which can now handle more or less any arbitrary CSV
file with arbitrary names and fields.
This also includes a bunch of additional, probably unnecessary, tweaks:
- summary.py/coverage.py use a custom fractional type for encoding
fractions, this will also be used for test counts.
- Added a smaller diff output for size scripts with the --percent flag.
- Added line and hit info to coverage.py's CSV files.
- Added --tree flag to stack.py to show only the call tree without
other noise.
- Renamed structs.py to struct.py.
- Changed a few flags around for consistency between size/summary scripts.
- Added `make sizes` alias.
- Added `make lfs.code.csv` rules
This is really more work for the bench runner. With this change defines
can be manipulated at a rather high level at runtime. Which should be
useful for generating benchmarks across various dimensions.
The define grammar in the test_runner is now a bit more powerful,
accepting:
1. A single value: -DN=42
2. A list of values, which get permuted: -DN=1,2,3
3. A range: -DN=range(10)
4. Some combo: -DN=1,2,range(3,0,-1)
This is more complex in the test .toml defines, which can also be C
expressions:
1. A single value: define=42
2. A single expression: define='42*42'
3. A list: define=[1,2,3]
4. A comma separated string: define='1,2,3'
5. A range: define='42*range(10)'
6. This mess: define=[1,2,'3,4,range(2)*range(2)+3']
This is probably how the test runner should have been implemented in the
first place, but it took a few tries to get here.
This makes it so the test identifier, which is a bit longer now, fully
encodes the state of the defines in the test. This removes the need for
the extra geometry field and allows reproduction of tests with custom
defines at runtime.
The test runner may have already seemed like a solved problem, but these
changes are really to enable repurposing the test runner as a bench
runner.
These are just some minor quality of life improvements
- Added a "make build-test" alias
- Made test runner a positional arg for test.py since it is almost
always required. This shortens the command line invocation most of the
time.
- Added --context to test.py
- Renamed --output in test.py to --stdout, note this still merges
stderr. Maybe at some point these should be split, but it's not really
worth it for now.
- Reworked the test_id parsing code a bit.
- Changed the test runner --step to take a range such as -s0,12,2
- Changed tracebd.py --block and --off to take ranges
Previously didn't think this would work without making test.py aware of
the number of implicit defines, which risks being incredibly fragile.
Fortunately it turns out we can defer the actual array size calculation
until the C preprocessor. This simplifies a few things.
Also a bitmap-based caching layer for the defines. Since the test
defines have been upgraded to callbacks recursive defines risk spending
a decent amount of time evaluating on every lookup. Some quick testing
shows 408015154 hits to 46160 misses so that's a good sign.
Also changed the geometries to be their own leb16-encoded part of the
test identifier. This means any geometry can be captured and reproduced
with just the test identifier. Here are the current test geometries:
./runners/test_runner --list-geometries
geometry read prog erase count size leb16
d,default 16 16 512 2048 1048576 g1gg2
e,eeprom 1 1 512 2048 1048576 1gg2
E,emmc 512 512 512 2048 1048576 gg2
n,nor 1 1 4096 256 1048576 1ggg1
N,nand 4096 4096 32768 32 1048576 ggg1ggg8
Based on a handful of local hacky variations, this sort of trace
rendering is surprisingly useful for getting an understanding of how
different filesystem operations interact with the underlying
block-device.
At some point it would probably be good to reimplement this in a
compiled language. Parsing and tracking the trace output quickly
becomes a bottleneck with the amount of trace output the tests
generate.
Note also that since tracebd.py run on trace output, it can also be
used to debug logged block-device operations post-run.
This mostly involved futzing around with some of the less intuitive
parts of Unix's named-pipes behavior.
This is a bit important since the tests can quickly generate several
gigabytes of trace output.
These have no real purpose other than slowing down the simulation
for inspection/fun.
Note this did reveal an issue in pretty_asserts.py which was clobbering
feature macros. Added explicit, and maybe a bit hacky, #undef _FEATURE_H
to avoid this.
With more features being added to test.py, the one-line status is
starting to get quite long and pass the ~80 column readability
heuristic. To make this worse this clobbers the terminal output
when the terminal is not wide enough.
Simple solution is to disable line-wrapping, potentially printing
some garbage if line-wrapping-disable is not supported, but also
printing a final status update to fix any garbage and avoid a race
condition where the script would show a non-final status.
Also added --color which disables any of this attempting-to-be-clever
stuff.
The main change here from the previous test framework design is:
1. Powerloss testing remains in-process, speeding up testing.
2. The state of a test, included all powerlosses, is encoded in the
test id + leb16 encoded powerloss string. This means exhaustive
testing can be run in CI, but then easily reproduced locally with
full debugger support.
For example:
./scripts/test.py test_dirs#reentrant_many_dir#10#1248g1g2 --gdb
Will run the test test_dir, case reentrant_many_dir, permutation #10,
with powerlosses at 1, 2, 4, 8, 16, and 32 cycles. Dropping into gdb
if an assert fails.
The changes to the block-device are a work-in-progress for a
lazily-allocated/copy-on-write block device that I'm hoping will keep
exhaustive testing relatively low-cost.
- Renamed explode_asserts.py -> pretty_asserts.py, this name is
hopefully a bit more descriptive
- Small cleanup of the parser rules
- Added recognization of memcmp/strcmp => 0 statements and generate
the relevant memory inspecting assert messages
I attempted to fix the incorrect column numbers for the generated
asserts, but unfortunately this didn't go anywhere and I don't think
it's actually possible.
There is no column control analogous to the #line directive. I thought
you might be able to intermix #line directives to put arguments at the
right column like so:
assert(a == b);
__PRETTY_ASSERT_INT_EQ(
#line 1
a,
#line 1
b);
But this doesn't work as preprocessor directives are not allowed in
macros arguments in standard C. Unfortunately this is probably not
possible to fix without better support in the language.
On one hand this isn't very different than the source annotation in
gcov, on the other hand I find it a bit more readable after a bit of
experimentation.
These scripts can't easily share the common logic, but separating
field details from the print/merge/csv logic should make the common
part of these scripts much easier to create/modify going forward.
This also tweaked the behavior of summary.py slightly.
This also adds coverage support to the new test framework, which due to
reduction in scope, no longer needs aggregation and can be much
simpler. Really all we need to do is pass --coverage to GCC, which
builds its .gcda files during testing in a multi-process-safe manner.
The addition of branch coverage leverages information that was available
in both lcov and gcov.
This was made easier with the addition of the --json-format to gcov
in GCC 9.0, however the lax backwards compatibility for gcov's
intermediary options is a bit concerning. Hopefully --json-format
sticks around for a while.
GCC is a bit annoying here, it can't generate .cgi files without
generating the related .o files, though I suppose the alternative risks
duplicating a large amount of compilation work (littlefs is really
a small project).
Previously we rebuilt the .o files anytime we needed .cgi files
(callgraph info used for stack.py). This changes it so we always
built .cgi files as a side-effect of compilation. This is similar
to the .d file generation, though may be annoying if the system
cc doesn't support --callgraph-info.
A small mistake in test.py's control flow meant the failing test job
would succesfully kill all other test jobs, but then humorously start
up a new process to continue testing.
This simplifies the interaction between code generation and the
test-runner.
In theory it also reduces compilation dependencies, but internal tests
make this difficult.
This mostly required names for each test case, declarations of
previously-implicit variables since the new test framework is more
conservative with what it declares (the small extra effort to add
declarations is well worth the simplicity and improved readability),
and tweaks to work with not-really-constant defines.
Also renamed test_ -> test, replacing the old ./scripts/test.py,
unfortunately git seems to have had a hard time with this.