The documentation is still very much in progress, but I thought I'd share this
library/DSL that I've been working on, which is inspired by my experiences in academia
and quant finance, as well as a desire to improve on scikit-learn's "pipeline" module.
I'm hopeful that others might see the benefits of this style of writing data-modeling
pipelines, or perhaps critique it mercilessly :)<p>One point of inspiration here is my experience that many researchers and data
scientists in industry, while fastidious about the in-sample/out-of-sample distinction
for the predictive "core" of their model (i.e. the regression or the ANN or whatever),
are often less conscious of the same distinction for all of the feature-preparation
steps and prediction transformations that may precede or follow that core in their
over-all pipeline. For example, before fitting your regression or whatever, you might
z-score some of the predictors, and when you later apply that now-fit regression to make
predictions on some held-out data, <i>you really ought to</i> z-score the held-out feature
values using the same means and standard deviations that were "learned" at fit-time
(they should be considered part of the "state" of your model). But this is often
inconvenient or overlooked by practitioners, who might naively z-score the held-out
batch using <i>its own</i> means and SDs, before feeding it to their trained model.<p>But if you imagine receiving the held-out data not as a batch, but as a stream of
observation one at a time, and needing to generate a prediction for each as it arrives,
it's clear that this is not correct! (As you make each prediction, you can't know what
the held-out batch's means and SDs "will be".)<p>So with Frankenfit, I wanted to make it <i>easy</i> to write end-to-end data modeling
pipelines where this distinction between fit-time and apply-time is <i>baked in</i> all the
way through the intermediate transformations, be they z-scores, winsorizations,
imputations, or whatever. And it turns out that this can also make for very elegant
expressions of common resampling workflows like cross-validation, hyperparameter search,
sequential learning, and so on.