Guh
This may have been more work than I expected. The goal was to allowing
passing recursive results (callgraph info, structs, etc) between
scripts, which is simply not possible with csv files.
Unfortunately, this raised a number of questions: What happens if a
script receives recursive results? -d/--diff with recursive results?
How to prevent folding of ordered results (structs, hot, etc) in piped
scripts? etc.
And ended up with a significant rewrite of most of the result scripts'
internals.
Key changes:
- Most result scripts now support -O/--output-json in addition to
-o/--json, with -O/--output-json including any recursive results in
the "children" field.
- Most result scripts now support both csv and json as input to relevant
flags: -u/--use, -d/--diff, -p/--percent. This is accomplished by
looking for a '[' as the first character to decide if an input file is
json or csv.
Technically this breaks if your json has leading whitespace, but why
would you ever keep whitespace around in json? The human-editability
of json was already ruined the moment comments were disallowed.
- csv.py requires all fields to be explicitly defined, so added
-i/--enumerate, -Z/--children, and -N/--notes. At least we can provide
some reasonable defaults so you shouldn't usually need to type out the
whole field.
- Notably, the rendering scripts (plot.py, treemapd3.py, etc) and
test/bench scripts do _not_ support json. csv.py can always convert
to/from json when needed.
- The table renderer now supports diffing recursive results, which is
nice for seeing how the hot path changed in stack.py/perf.py/etc.
- Moved the -r/--hot logic up into main, so it also affects the
outputted results. Note it is impossible for -z/--depth to _not_
affect the outputted results.
- We now sort in one pass, which is in theory more efficient.
- Renamed -t/--hot -> -r/--hot and -R/--reverse-hot, matching -s/-S.
- Fixed an issue with -S/--reverse-sort where only the short form was
actually reversed (I misunderstood what argparse passes to Action
classes).
- csv.py now supports json input/output, which is funny.
Unfortunately the import sys in the argparse block was hiding missing
sys imports.
The mistake was assuming the import sys in Python would limit the scope
to that if block, but Python's late binding strikes again...
Apparently __builtins__ is a CPython implementation detail, and behaves
differently when executed vs imported???
import builtins is the correct way to go about this.
Moved local import hack behind if __name__ == "__main__"
These scripts aren't really intended to be used as python libraries.
Still, it's useful to import them for debugging and to get access to
their juicy internals.
Instead of trying to be too clever, this just adds a bunch of small
flags to control parts of table rendering:
- --no-header - Don't show the header.
- --small-header - Don't show by field names.
- --no-total - Don't show the total.
- -Q/--small-table - Equivalent to --small-header + --no-total.
Note that -Q/--small-table replaces the previous -Y/--summary +
-c/--compare hack, while also allowing a similar table style for
non-compare results.
This reverts per-result source file mapping, and tears out of a bunch of
messy dwarf parsing code. Results from the same .o file are now mapped
to the same source file.
This was just way too much complexity for slightly better result->file
mapping, which risked losing results accidentally mapped to the wrong
file.
---
I was originally going to revert all the way back to relying strictly on
the .o name and --build-dir (490e1c4) (this is the simplest solution),
but after poking around in dwarf-info a bit, I realized we do have
access to the original source file in DW_TAG_compile_unit's
DW_AT_comp_dir + DW_AT_name.
This is much simpler/more robust than parsing objdump --dwarf=rawline,
and avoid needing --build-dir in a bunch of scripts.
---
This also reverts stack.py to rely only on the .ci files. These seem as
reliable as DW_TAG_compile_unit while simplifying things significantly.
Symbol mapping used to be a problem, but this was fixed by using the
symbol in the title field instead of the label field (which strips some
optimization suffixes?)
Now that cycle detection is always done at result collection time, we
don't need this in the table renderer itself.
This had a tendency to cause problems for non-function scripts (ctx.py,
structs.py).
God, I wish Python had an OrderedSet.
This is a fix for duplicate "cycle detected" notes when using -t/--hot.
This mix of merging both _hot_notes and _notes in the HotResult class is
tricky when the underlying container is a list.
The order is unlikely to be guaranteed anyways, when different results
with different notes are folded.
And if we ever want more control over the order of notes in result
scripts we can always change this back later.
- Error on no/insufficient files.
Instead of just returning no results. This is more useful when
debugging complicated bash scripts.
- Use elf magic to allow any file order in perfbd.py/stack.py.
This was already implemented in stack.py, now also adopted in
perfbd.py.
Elf files always start with the magic string "\x7fELF", so we can use
this to figure out the types of input files without needing to rely on
argument order.
This is just one less thing to worry about when invoking these
scripts.
- Prevented childrenof memoization from hiding the source of a
detected cycle.
- Deduplicated multiple cycle detected notes.
- Fixed note rendering when last column does not have a notes list.
Currently this only happens when entry is None (no results).
It seems like a good rule of thumb is for every XInfo class to be paired
with at least a small X class wrapper:
- Easier to extend without breaking tuple unpacking everywhere
- Better code readability
- Better memory reuse in _by_addr caches (less tuple repacking)
We have this info, might as well expose it for scripts to use.
Unfortunately this extra info did make tuple unpacking a bit of a mess,
especially in scripts that don't use this extra info, so I've added a
small Sym class similar to DwarfEntry in collect_dwarf_info.
This is useful for some ongoing stack.py rework.
There is an argument for prefering nm for code size measurements due to
portability. But I'm not sure this really holds up these days with
objdump being so prevalent.
We already depend on objdump for ctx/structs/perf and other dwarf info,
so using objdump -t to get symbol information means one less tool to
depend on/pass around when cross-compiling.
As a minor benefit this also gives us more control over which sections
to include, instead of relying on nm's predefined t/r/d/b section types.
---
Note code.py/data.py did _not_ require objdump before this. They did use
objdump to map symbols to source files, but would just guess if
objdump wasn't available.
Without this, naming a column i/children/notes in csv.py could cause
things to break. Unlikely for children/notes, but very likely for i,
especially when benchmarking.
Unfortunately namedtuple makes this tricky. I _want_ to just rename
these to _i/_children/_notes and call the problem solved, but namedtuple
reserves all underscore-prefixed fields for its own use.
As a workaround, the table renderer now looks for _i/_children/_notes at
the _class_ level, as an optional name of which namedtuple field to use.
This way Result types can stay lightweight namedtuples while including
extra table rendering info without risk of conflicts.
This also makes the HotResult type a bit more funky, but that's not a
big deal.
This extends the recursive part of the table renderer to sort children
by the optional "i" field, if available.
Note this only affects children entries. The top-level entries are
strictly ordered by the relevant "by" fields. I just haven't seen a use
case for this yet, and not sorting "i" at the top-level reduces that
number of things that can go wrong for scripts without children.
---
This also rewrites -t/--hot to take advantage of children ordering by
injecting a totally-no-hacky HotResult subclass.
Now -t/--hot should be strictly ordered by the call depth! Though note
entries that share "by" fields are still merged...
This also gives us a way to introduce the "cycle detected" note and
respect -z/--depth, so overall a big improvement for -t/--hot.
We don't really need padding for the notes on the last column of tables,
which is where row-level notes end up.
This may seem minor, but not padding here avoids quite a bit of
unnecessary line wrapping in small terminals.
- Adopted higher-level collect data structures:
- high-level DwarfEntry/DwarfInfo class
- high-level SymInfo class
- high-level LineInfo class
Note these had to be moved out of function scope due to pickling
issues in perf.py/perfbd.py. These were only function-local to
minimize scope leak so this fortunately was an easy change.
- Adopted better list-default patterns in Result types:
def __new__(..., children=None):
return Result(..., children if children is not None else [])
A classic python footgun.
- Adopted notes rendering, though this is only used by ctx.py at the
moment.
- Reverted to sorting children entries, for now.
Unfortunately there's no easy way to sort the result entries in
perf.py/perfbd.py before folding. Folding is going to make a mess
of more complicated children anyways, so another solution is
needed...
And some other shared miscellany.
It looks like the failure case in our scripts' subprocess stderr
handling was not tested well during a fix to stderr blocking (a735bcd).
This code was attempting to print stderr only if an error occured, but
with stderr=None this just results in a NoneType TypeError.
In retrospect, completely hiding stderr is kind of shitty if a
subprocess fails, but it doesn't seem possible to read from both stdin
and stderr with Python's APIs without getting stuck when the stderr's
buffer is full.
It might be possible to work around this with either multithreading,
select calls, or a temp file, but I'm not sure slightly less verbose
scripts are worth the added complexity in every single subprocess call.
For now just reverting to unconditionally forwarding stderr from the
child process. This is the simplest/most robust option.
- stack.py:collect -> collect + collect_cov
- perf.py:collect_syms_and_lines -> collect_syms + collect_dwarf_lines
- perfbd.py:collect_syms_and_lines -> collect_syms + collect_dwarf_lines
This should hopefully lead to both better readability and better code
reuse.
Note collect_dwarf_lines is a bit different than collect_dwarf_files in
code.py/data.py/etc, but the extra complexity of collect_dwarf_lines is
probably not worth sharing here.
The fact that our scripts' table renderer was slightly different for
recursive scripts (stack.py, perf.py) and non-recursive scripts
(code.py, structs.py) was a ticking time bomb, one innocent edit away
from breaking half the scripts.
The makes the table renderer consistent across all scripts, allowing for
easy copy-pasting when editing at the cost of some unused code in
scripts.
One hiccup with this though is the difference in cycle detection
behavior between scripts:
- stack.py:
lfsr_bd_sync
'-> lfsr_bd_prog
'-> lfsr_bd_sync <-- cycle!
- structs.py:
lfsr_bshrub_t
'-> u
'-> bsprout
'-> u <-- not a cycle!
To solve this the table renderer now accepts a simple detect_cycles
flag, which can be set per-script.
This makes the -p/--percent flag a bit more consistent with -d/--diff
and -c/--compare, both of which change the printing strategy based on
additional context.
This showcases the sort of high-level result printing where -c/--compare
is useful:
$ make summary-diff
code data stack structs
BEFORE 57057 0 3056 1476
AFTER 68864 (+20.7%) 0 (+0.0%) 3744 (+22.5%) 1520 (+3.0%)
There was one hiccup though: how to hide the name of the first field.
It may seem minor, but the missing field name really does help
readability when you're staring at a wall of CLI output.
It's a bit of a hack, but this can now be controlled with -Y/--summary,
which has the sole purpose of disabling the first field name if mixed
with -c/--compare.
-c/--compare is already a weird case for the summary row anyways...
Example:
$ ./scripts/csv.py lfs.code.csv \
-bfunction -fsize \
-clfsr_rbyd_appendrattr
function size
lfsr_rbyd_appendrattr 3598
lfsr_mdir_commit 5176 (+43.9%)
lfsr_btree_commit__.constprop.0 3955 (+9.9%)
lfsr_file_flush_ 2729 (-24.2%)
lfsr_file_carve 2503 (-30.4%)
lfsr_mountinited 2357 (-34.5%)
... snip ...
I don't think this is immediately useful for our code/stack/etc
measurement scripts, but it's certainly useful in csv.py for comparing
results at a high level.
And by useful I mean it replaces a 40-line long awk script that has
outgrown its original purpose...
This may make some mathematician mad, but these are informative scripts.
Returning +-inf is much more useful than erroring when dealing with
several hundred rows of results.
And hey, if it's good enough for IEEE 754, it's good enough for us :)
Also fixed a division operator mismatch in RFrac that was causing
problems.
Not sure if this is an old habit from Python 2, or just because it looks
nicer next to __mul__, __mod__, etc, but in Python 3 this should be
__truediv__ (or __floordiv__), not __div__.
I still think the 24 (23+1) char minimum is a good default for 2 column
output such as help text, especially if you don't have automatic width
detection. But our result scripts need to be a bit more flexible.
Consider:
$ make summary
code data stack structs
TOTAL 68864 0 3744 1520
Vs:
$ make summary
code data stack structs
TOTAL 68864 0 3744 1520
Up until now we were just kind of working around this with cut -c 25- in
our Makefile, but now that our result scripts automatically scale the
table widths, they should really just default to whatever is the most
useful.
- RInt/RFloat now accepts implicitly castable types (mainly
RInt(RFloat(x)) and RFloat(RInt(x))).
- RInt/RFloat/RFrac are now "truthy", implements __bool__.
- More operator support for RInt/RFloat/RFrac:
- __pos__ => +a
- __neg__ => -a
- __abs__ => abs(a)
- __div__ => a/b
- __mod__ => a%b
These work in Python, but are mainly used to implement expr eval in
csv.py.
This seems like a more fitting name now that this script has evolved
into more of a general purpose high-level CSV tool.
Unfortunately this does conflict with the standard csv module in Python,
breaking every script that imports csv (which is most of them).
Fortunately, Python is flexible enough to let us remove the current
directory before imports with a bit of an ugly hack:
# prevent local imports
__import__('sys').path.pop(0)
These scripts are intended to be standalone anyways, so this is probably
a good pattern to adopt.
This matches the style used in C, which is good for consistency:
a_really_long_function_name(
double_indent_after_first_newline(
single_indent_nested_newlines))
We were already doing this for multiline control-flow statements, simply
because I'm not sure how else you could indent this without making
things really confusing:
if a_really_long_function_name(
double_indent_after_first_newline(
single_indent_nested_newlines)):
do_the_thing()
This was the only real difference style-wise between the Python code and
C code, so now both should be following roughly the same style (80 cols,
double-indent multiline exprs, prefix multiline binary ops, etc).
Mainly to avoid conflicts with match results m, this frees up the single
letter variables m for other purposes.
Choosing a two letter alias was surprisingly difficult, but mt is nice
in that it somewhat matches it (for itertools) and ft (for functools).
So now the hot path participates in sorting, folding, etc:
$ ./scripts/stack.py ./lfs.ci ./lfs_util.ci \
-Dfunction=lfsr_mount -t -sframe
function frame limit
lfsr_mount 96 2736
|-> lfsr_mdir_commit 512 2368
|-> lfsr_btree_commit__.constprop 336 1648
|-> lfs_alloc 272 1296
|-> lfsr_btree_commit 208 1856
|-> lfsr_btree_lookupnext_ 208 720
|-> lfsr_mtree_gc 192 2560
|-> lfsr_mtree_traverse 176 1024
|-> lfsr_rbyd_lookupnext 160 448
|-> lfsr_bd_readtag.constprop 128 288
|-> lfsr_mtree_lookup 128 848
|-> lfsr_bd_read 80 160
|-> lfsr_bd_read__ 80 80
|-> lfsr_fs_gc 80 2640
|-> lfsr_rbyd_sublookup 64 512
'-> lfsr_rbyd_alloc 16 1312
TOTAL 96 2736
This risks some rather unintuitive behavior now that the hot path
rendering no longer matches the call stack, but in theory the extra
sorting features are more useful?
This is a bit of an experiment, if this is more confusing than useful,
we can always revert to the strict call-order ordering.
Note that you can _usually_ get the call-order ordering by sorting by
limit, but this trick breaks if any call frames are zero sized...
This fixes an issue where mixing recursive renderers (-t/--hot or
-z/--depth) with defines (-Dfunction=lfsr_mount) would not account for
children entry widths. An unexpected side-effect of no longer filtering
the children entries.
We could continue to try to estimate the width without table rendering,
but it would basically need two full recursive pass at this point...
Instead, I've just moved the recursive stuff before table rendering,
which should remove any issues with width calculation while also
deduplicating the recursive passes.
It's invasive for a small change, but probably worthwhile long term.
The downside is this does mean our recursive scripts now build the full
table (including all recursive calls!) before they start printing. When
mixed with unbounded recursive depth (-z0 or --depth=0) this can get
quite large and cause quite a slow start.
But I guess that was the tradeoff in adopting this sort of intermediate
table rendering... At least it does make the code simpler and less bug
prone...
This makes -D/--define more useful in stack.py/perf.py/perfbd.py by no
longer hiding undfined children entries.
For example:
$ ./scripts/stack.py lfs.ci lfs_util.ci -Dfunction=lfsr_mount -t
function frame limit
lfsr_mount 96 2816
|-> lfsr_fs_gc 80 2720
|-> lfsr_mtree_gc 176 2640
|-> lfsr_mdir_commit 576 2464
... snip ...
Now shows all functions in the hot path of lfsr_mount, where before it
would only show functions in the hot path of lfsr_mount that were also
_named_ lfsr_mount.
The previous behavior was technically not wrong... but not very useful
(and confusing).
---
This was actually quite a bit annoying to get working because of the
possibility of function call cycles.
I ended up turning stack.py's result type into a fully connected graph,
which only works because Python has a cycle detector. (Actually this
script is so short-lived we probably wouldn't care if this leaked
memory.)
A nice side effect of this is now all the recursive scripts (stack.py,
perf.py, and perfbd.py) share the same internal result representation
and recursive printing logic, which is probably a good thing.
To better match -z/--depth and -t/--hot.
The fact that these short forms all don't match the first letter of the
long forms is humorous but unintentional. There's only so many letters
in the alphabet!
As a convenience, -d/--diff in our measurement scripts hides entries
that are unchanged by default.
Unfortunately this was broken during a recent refactor that ended up
filtering the line info but not the actual names.
Instead of reverting the broken part of the refactor, I've just moved the
filtering up to where we calculate the names. Hopefully this fixes the
bug while also simplifying this messy chunk of a logic a bit.
This is mainly useful for stack.py, where -t/--hot lets you quickly see
everything that contributes to the stack limit for each function.
This was (and still is) possible with -s + -z, but it was pretty
annoying to use:
- The stack trace rendered _diagonally_ as a consequence of -z, which is
probably the worst use of screen real estate.
- This trick only really worked with -s, which was the opposite order of
what you usually want on the command line: -S.
Adding a special for-purpose -t/--hot flag makes looking at the hot path
much easier, at the cost of more hacky python code (and I _mean_ hacky,
making the hot path selection useful while following exising sort rules
was annoyingly complicated).
Also added -t/--hot to perf.py and perfbd.py for consistency, though it
makes a bit less sense there.
Also also reworked related code in all three scripts: stack.py, perf.py,
perfbd.py. The logic should be a bit more equivalent, and
perf.py/perfbd.py detect cycles now.
stack.py actually already had a simple cycle detector, since we needed
one to calculate stack limits without getting stuck.
Copying this simple cycle detector into the actual table rendering code
lets us print a nice little "cycle detected" message, instead of just
vomiting to stdout forever:
$ ./scripts/stack.py lfs.ci lfs_util.ci -z -s
function frame limit
lfsr_format 320 ∞
|-> lfsr_mountinited 304 ∞
| |-> lfsr_mountmroot 80 ∞
| | |-> lfsr_mountmroot 80 ∞ (cycle detected)
| | |-> lfsr_mdir_lookup 48 576
... snip ...
The cycle detector is a bit naive, just building a new set each step,
but it gets the job done.
As for perf.py and perfbd.py, it turns out they can't actually create
cycles, so no need for a cycle detector. This is good because I didn't
really want to test these scripts again :)
code.py, specifically, was getting messed up by inconsequential GCC
objdump errors on Clang -g3 generated binaries.
Now stderr from child processes is just redirected to /dev/null when
-v/--verbose is not provided.
If we actually depended on redirecting stderr->stdout these scripts
would have been broken when -v/--verbose was provided anyways. Not
really sure what the original code was trying to do...
The original idea was to allow merging a whole bunch of different csv
results into a single lfs.csv file, but this never really happened. It's
much easier to operate on smaller context-specific csv files, where the
field prefix:
- Doesn't really add much information
- Requires more typing
- Is confusing in how it doesn't match the table field names.
We can always use summary.py -fcode_size=size to add prefixes when
necessary anyways.
We already rely on this symbol in these scripts, so might use it to
display the mathematically correct ratio for new entries.
This has the added benefit of ordering new entries vs extremely big
changes correctly:
$ ./scripts/code.py -u test.after.csv -d test.before.csv
function (1 added, 0 removed) osize nsize dsize
test_a - 49 +49 (+∞%)
test_b 19 719 +700 (+3684.2%)
test_c 91 191 +100 (+109.9%)
TOTAL 110 959 +849 (+771.8%)
This is a bit more complicated, but make testmarks really showed how
confusing this could get.
Now, instead of:
suite passed time
test_alloc 304/304 1.6 (100.0%)
test_badblocks 6880/6880 1323.3 (100.0%)
... snip ...
test_rbyd 385878/385878 592.7 (100.0%)
test_relocations 7899/7899 318.8 (100.0%)
TOTAL 548206/548206 6229.7 (100.0%)
Percents/notes are interspersed next to their relevant fields:
suite passed time
test_alloc 304/304 (100.0%) 1.6
test_badblocks 6880/6880 (100.0%) 1323.3
... snip ...
test_rbyd 385878/385878 (100.0%) 592.7
test_relocations 7899/7899 (100.0%) 318.8
TOTAL 548206/548206 (100.0%) 6229.7
Note has no effect on scripts with only a single field (code.py, etc).
But it does make multi-field diffs a bit more readable:
$ ./scripts/stack.py -u after.stack.csv -d before.stack.csv -p
function frame limit
lfsr_bd_sync 8 (+100.0%) 216 (+100.0%)
lfsr_bd_flush 40 (+25.0%) 208 (+4.0%)
... snip ...
lfsr_file_flush 32 (+0.0%) 2424 (-0.3%)
lfsr_file_flush_ 216 (-3.6%) 2392 (-0.3%)
TOTAL 9008 (+0.4%) 2600 (-0.3%)
This matches how diff percentages are rendered, and simplifies the
internal table rendering by making Frac less of a special case. It also
allows for other type notes in the future.
One concern is how all the notes are shoved to the side, which may make
it a bit harder to find related percentages. If this becomes annoying we
should probably look into interspersing all notes (including diff
percentages) between the relevant columns.
Before:
function lines branches
lfsr_rbyd_appendattr 230/231 99.6% 172/192 89.6%
lfsr_rbyd_p_recolor 33/34 97.1% 11/12 91.7%
lfs_alloc 40/42 95.2% 21/24 87.5%
lfsr_rbyd_appendcompaction 54/57 94.7% 39/42 92.9%
...
After:
function lines branches
lfsr_rbyd_appendattr 230/231 172/192 (99.6%, 89.6%)
lfsr_rbyd_p_recolor 33/34 11/12 (97.1%, 91.7%)
lfs_alloc 40/42 21/24 (95.2%, 87.5%)
lfsr_rbyd_appendcompaction 54/57 39/42 (94.7%, 92.9%)
...
Previously, with -d/--diff, we would only show non-zero percentages. But
this was ambiguous/confusing when dealing with multiple results
(stack.py, summary.py, etc).
To help with this, I've switched to showing all percentages unless all
percentages are zero (no change). This matches the -d/--diff row-hiding
logic, so by default all rows should show all percentages.
Note -p/--percent did not change, as it already showed all percentages
all of the time.
Previously, the intention of upper case -Z was the match -W/--width and
-H/--height, which are uppercase to avoid conflicts with -h/--help.
But -z/--depth isn't _really_ related to -W/-H.
This avoids a conflict with -Z/--lebesgue, but may conflict with
-z/--cat. Fortunately we don't currently have any conflicts with the
latter. Since -z/--depth and -Z/--lebesgue are both disk-layout related,
the risk of conflicts are probably much higher there.