TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Frankenfit, a Python-based DSL for predictive data modeling pipelines

2 pointsby 6502nerdfaceover 2 years ago

1 comment

6502nerdfaceover 2 years ago
The documentation is still very much in progress, but I thought I&#x27;d share this library&#x2F;DSL that I&#x27;ve been working on, which is inspired by my experiences in academia and quant finance, as well as a desire to improve on scikit-learn&#x27;s &quot;pipeline&quot; module. I&#x27;m hopeful that others might see the benefits of this style of writing data-modeling pipelines, or perhaps critique it mercilessly :)<p>One point of inspiration here is my experience that many researchers and data scientists in industry, while fastidious about the in-sample&#x2F;out-of-sample distinction for the predictive &quot;core&quot; of their model (i.e. the regression or the ANN or whatever), are often less conscious of the same distinction for all of the feature-preparation steps and prediction transformations that may precede or follow that core in their over-all pipeline. For example, before fitting your regression or whatever, you might z-score some of the predictors, and when you later apply that now-fit regression to make predictions on some held-out data, <i>you really ought to</i> z-score the held-out feature values using the same means and standard deviations that were &quot;learned&quot; at fit-time (they should be considered part of the &quot;state&quot; of your model). But this is often inconvenient or overlooked by practitioners, who might naively z-score the held-out batch using <i>its own</i> means and SDs, before feeding it to their trained model.<p>But if you imagine receiving the held-out data not as a batch, but as a stream of observation one at a time, and needing to generate a prediction for each as it arrives, it&#x27;s clear that this is not correct! (As you make each prediction, you can&#x27;t know what the held-out batch&#x27;s means and SDs &quot;will be&quot;.)<p>So with Frankenfit, I wanted to make it <i>easy</i> to write end-to-end data modeling pipelines where this distinction between fit-time and apply-time is <i>baked in</i> all the way through the intermediate transformations, be they z-scores, winsorizations, imputations, or whatever. And it turns out that this can also make for very elegant expressions of common resampling workflows like cross-validation, hyperparameter search, sequential learning, and so on.