This project reminds me a lot of Dask <a href="https://dask.org/" rel="nofollow">https://dask.org/</a>. A library that allows delayed calculation of complex dataframes in Python.
The purpose of this is discussed in their blog post, which is non-prominently linked in the README: <a href="https://multithreaded.stitchfix.com/blog/2021/10/14/functions-dags-hamilton/" rel="nofollow">https://multithreaded.stitchfix.com/blog/2021/10/14/function...</a>
I haven't evaluated how Hamilton is implemented specifically but:<p>* It's solving a really different problem than Spark/Dask/etc. It could definitely use those tools but it's just not the same thing.<p>* If you're looking at this and thinking it's useless even if you're familiar with Pandas/dataframes it's probably just that you haven't had to work on the types of problems that this particular tool is intended to help with.
So it's like spark for pandas? Seems like it might be better just to use spark and if there are features missing, build a framework to add in missing features on top of that - in which case you get a giant distributed processing engine for free with it. Would be interested to know if that was a consideration or not.
Could someone please help me understand what a "dataframe" is? I see this term thrown around occasionally, but failed to find a definition/explanation for someone who doesn't actually already know what it is :(
Reminds me of Dagger from the recent post "An oral history of Bank Python": <a href="https://calpaterson.com/bank-python.html" rel="nofollow">https://calpaterson.com/bank-python.html</a>
This reminds me a bit of a Clojure library called Plumbing (formerly Graph): <a href="https://github.com/plumatic/plumbing" rel="nofollow">https://github.com/plumatic/plumbing</a>. It also let you create a DAG for structured computation. It was used for a web service, at that time.
My pynto <a href="https://github.com/punkbrwstr/pynto" rel="nofollow">https://github.com/punkbrwstr/pynto</a> is a similar framework for creating dataframes, but using a concatenative paradigm that treats the frame as a stack of columns. Functions ("words") operate on the stack to set up the graph for each column, and execution happens afterwards in parallel. Instead of function modifiers like @does it uses combinators to apply quoted operations to multiple columns. The postfix syntax (think postscript or factor) is unambiguous, if a bit old-school.
I'm curious what the average sentiment from the Stitch data team is on this. I see the small marginal utility of this, but this would be a massive pain to implement. Imagine going back thru thousands of lines of transforms and adding the framework. Say you make a couple small mistakes somewhere because the framework is new to you. Things seem fine at first, but weeks go by and something seems off. How do you find those mistakes?<p>Newly hired data scientists would have a "wtf is this thing?" response. You'd really need to "sell" people on this and it doesn't seem worth it.
This is neat for toy problems but I don't see it working well for "real" pipelines. The magical DAG creation is going to be super hard to wrap your head around and even worse to debug.<p>This reminds me of an internal Google tool for doing async programming in Java (ProducerGraph or something). The idea was that you'd just write annotated functions and the framework would handle all the async stuff. Wasted many thousands of engineering hours while giving an even worse experience than manipulating futures directly.
I think the README needs something to explain what you get when you write these functions. Apparently Hamilton then creates a DAG... OK, and what does that do for me?