So csv.py should now be mostly feature complete, aside from bugs.
I ended up dropping most of the bitwise operations for now. I can't
really see them being useful since csv.py and related scripts are
usually operating on purely numerical data. Worst case we can always add
them back in at some point.
I also considered dropping the logical/ternary operators, but even
though I don't see an immediate use case, the flexibility
logical/ternary operators add to a language is too much to pass on.
Another interesting thing to note is the extension of all fold functions
to operate on exprs if more than one argument is provided:
- max(1) => 1, fold=max
- max(1, 2) => 2, fold=sum
- max(1, 2, 3) => 3, fold=sum
To be honest, this is mainly just to allow a binary max/min function
without awkward naming conflicts.
Other than those changes this was pretty simple fill-out-the-definition
work.
This was more tricky than expected since Python's class scope is so
funky (I just eneded up with using lazy cached __get__ functions that
scan the RExpr class for tagged members), but these decorators help avoid
repeated boilerplate for common expr patterns.
We can even deduplicate binary expr parsing without sacrificing
precedence.
This is a work-in-progress, but the general idea is to replace the
existing rename mechanic in csv.py with a full expr parser:
$ ./scripts/csv.py input.csv -ba=x -fb=y+z
I've been putting this off for a while, as it feels like too big a jump
in complexity for what was intended to be a simple script. But
complexity is a bit funny in programming. Even if a full parser is more
difficult to implement, if it's the right grammar for the job, the
resulting script should end up both easier to understand and easier to
extend.
The original intention was that any sufficiently complicated math could
be implemented in ad-hoc Python scripts that operate directly on the CSV
files, but CSV parsing in Python is annoying enough that this never
really worked well.
But I'm probably overselling the complexity. This is classic CS stuff:
1. build a syntax tree
2. map symbols to input fields
3. typecheck, fold, eval, etc
One neat thing is that in addition to providing type and eval
information, our exprs can also provide information on how to "fold" the
field after eval. This kicks in when merging muliple rows when grouping
by -b/--by, and for finding the TOTAL results.
This can be used to merge stack results correctly with max:
$ ./scripts/csv.py stack.csv \
-fframe='sum(frame)' -flimit='max(limit)'
Or can be used to find other interesting measurements:
$ ./scripts/csv.py stack.csv \
-favg='avg(frame)' -fstddev='stddev(frame)'
These changes also make the eval order of input/output fields much
stricter which is probably a good thing.
This should replace all of the somewhat hacky fake-expr flags in csv.py:
- --int => -fa='int(b)'
- --float => -fa='float(b)'
- --frac => -fa='frac(b)'
- --sum => -fa='sum(b)'
- --prod => -fa='prod(b)'
- --min => -fa='min(b)'
- --max => -fa='max(b)'
- --avg => -fa='avg(b)'
- --stddev => -fa='stddev(b)'
- --gmean => -fa='gmean(b)'
- --gstddev => -fa='gstddev(b)'
If you squint you might be able to see a pattern.
This seems like a more fitting name now that this script has evolved
into more of a general purpose high-level CSV tool.
Unfortunately this does conflict with the standard csv module in Python,
breaking every script that imports csv (which is most of them).
Fortunately, Python is flexible enough to let us remove the current
directory before imports with a bit of an ugly hack:
# prevent local imports
__import__('sys').path.pop(0)
These scripts are intended to be standalone anyways, so this is probably
a good pattern to adopt.