Launch HN: DAGWorks – ML platform for data science teams

182 pointsby krawczstefabout 2 years ago

Hey HN! We’re Stefan and Elijah, co-founders of DAGWorks (<a href="https:///www.dagworks.io" rel="nofollow">https:///www.dagworks.io</a>). We’re on a mission to eliminate the insane inefficiency of building and maintaining ML pipelines in production.DAGWorks is based on Hamilton, an open-source project that we created and recently forked (<a href="https://github.com/dagworks-inc/hamilton">https://github.com/dagworks-inc/hamilton</a>). Hamilton is a set of high-level conventions for Python functions that can be automatically converted into working ETL pipelines. To that, we're adding a closed-source offering that goes a step further, plugging these functions into a wide array of production ML stacks.ML pipelines consist of computational steps (code + data) that produce a working statistical model that a business can use. A typical pipeline might be (1) pull raw data (Extract), (2) transform that data into inputs for the model (Transform), (3) define a statistical model (Transform), (4) use that statistical model to predict on another data set (Transform) and (5) push that data for downstream use (Load). Instead of “pipeline” you might hear people call this “workflow”, “ETL” (Extract-Transform-Load), and so on.Maintaining these in production is insanely inefficient because you need both data scientists and software engineers to do it. Data scientists know the models and data, but most can't write the code needed to get things working in production infrastructure—for example, a lot of mid-size companies out there use Snowflake to store data, Pandas/Spark to transform it, and something like databrick's MLFlow to handle model serving. Engineers can handle the latter, but mostly aren't experts in the ML stuff. It's a classic impedance mismatch, with all the horror stories you'd expect—e.g. when data scientists make a change, engineers (or data scientists who aren’t engineers) have to manually propagate the change in production. We've talked to teams who are spending as much as 50% of their time doing this. That's not just expensive, it's gruntwork—those engineers should be working on something else! Basically, maintaining ML pipelines over time sucks for most teams.One way out is to hire people who combine both skills, i.e. data scientists who can also write production code. But these are rare and expensive, and in our experience they usually are only expert at one side of the equation and not as good at the other.The other way is to build your own platform to automatically integrate models + data into your production stack. That way the data scientists can maintain their own work without needing to hand things off to engineers. However, most companies can't afford to make this investment, and even for the ones that can, such in-house layers tend to end up in spaghetti code and tech debt hell, because they're not the company's core product.Elijah and I have been building data and ML tooling for the last 7 years, most recently at Stitch Fix, where we built a ML platform that served over 100 data scientists from various modeling disciplines (some of our blog posts, like [1], hit the front page of HN - thanks!). We saw first hand the issues teams encountered with ML pipelines.Most companies running ML in production need a ratio of 1:1 or 2:1 data scientists to engineers. At bigger companies like Stitch Fix, the ratio is more like 10:1—way more efficient—because they can afford to build the kind of platform described above. With DAGWorks, we want to bring the power of an intuitive ML Pipeline platform to all data science teams, so a ratio of 1:1 is no longer required. A junior data scientist should be able to easily and safely write production code without deep knowledge of underlying infrastructure.We decided to build our startup around Hamilton, in large part due to the reception that it got here [2] - thanks HN! We came up with Hamilton while we were at Stitch Fix (note: if you start an open-source project at an employer, we recommend forking it right away when you start a company. We only just did that and left behind ~900 stars...). We are betting on it being our abstraction layer to enable our vision of how to go about building and maintaining ML pipelines, given what we learned at Stitch Fix. We believe a solution has to have an open source component to be successful (we invite you to check out the code). In terms of why the name DAGWorks? We named the company after Directed Acyclic Graphs because we think the DAG representation, which Hamilton also provides, is key.A quick primer on Hamilton. With Hamilton we use a new paradigm in Python (well not quite “new” as pytest fixtures use this approach) for defining model pipelines. Users write declarative functions instead of writing procedural code. For example, rather than writing the following pandas code:<pre><code> df['col_c'] = df['col_a'] + df['col_b'] </code></pre> You would write:<pre><code> def col_c(col_a: pd.Series, col_b: pd.Series) -> pd.Series: """Creating column c from summing column a and column b.""" return col_a + col_b </code></pre> Then if you wanted to create a new column that used `col_c` you would write:<pre><code> def col_d(col_c: pd.Series) -> pd.Series: # logic </code></pre> These functions then define a "dataflow" or a directed acyclic graph (DAG), i.e. we can create a “graph” with nodes: col_a, col_b, col_c, and col_d, and connect them with edges to know the order in which to call the functions to compute any result. Since you’re forced to write functions, everything becomes unit testable and documentation friendly, with the ability to display lineage. You can kind of think of Hamilton as "DBT for python functions", if you know what DBT is. Have we piqued your interest? Want to go play with Hamilton? We created <a href="https://www.tryhamilton.dev/" rel="nofollow">https://www.tryhamilton.dev/</a> leveraging pyodide (note it can take a while to load) so you can play around with the basics without leaving your browser - it even works on mobile!What we think is cool about Hamilton is that you don’t need to specify an “explicit pipeline declaration step”, because it’s all encoded in the function and parameter names! Moreover, everything is encapsulated in functions. So from a framework perspective, if we wanted to (for example) log timing information, or introspect inputs/outputs, delegate the function to Dask or Ray, we can inject that at a framework level, without having to pollute user code. Additionally, we can expose "decorators" (e.g. @tag(...)) that can specify extra metadata to annotate the DAG with, or for use at run time. This is where our DAGWorks Platform fits in, providing off-the-shelf closed source extras in this way.Now, for those of you thinking there’s a lot of competition in this space, or what we’re proposing sounds very similar to existing solutions, here’s some thoughts to help distinguish Hamilton from other approaches/technology: (1) Hamilton's core design principle is helping people write more maintainable code; at a nuts and bolts level, what Hamilton replaces is procedural code that one would write. (2) Hamilton runs anywhere that python runs: notebook, a python script, within airflow, within your python web service, pyspark, etc. E.g. People use Hamilton for executing code in batch tasks and online web services. (3) Hamilton doesn't replace a macro orchestration system like airflow, prefect, dagster, metaflow, zenML, etc. It runs within/uses them. Hamilton helps you not only model the micro - e.g. feature engineering - but can also help you model the macro - e.g. model pipelines. That said, given how big machines are these days, model pipelines can commonly run on a single machine - Hamilton is perfect for this. (4) Hamilton doesn't replace things like Dask, Ray, Spark -- it can run on them, or delegate to them. (5) Hamilton isn't just for building dataframes, though it’s quite good for that, you can model any python object creation with it. Hamilton is data type agnostic.Our closed source offering is currently in private beta, but we'd love to include you in it (see next paragraph). Hamilton is free to use (BSD-3 license) and we’re investing in it heavily. We’re still working through pricing options for the closed source platform; we think we’ll follow the leads of others in the space like Weights & Biases, and Hex.tech here in how they price. For those interested, here’s a video walkthrough of Hamilton, which includes a teaser of what we’re building on the closed source side - <a href="https://www.loom.com/share/5d30a96b3261490d91713a18ab27d3b7" rel="nofollow">https://www.loom.com/share/5d30a96b3261490d91713a18ab27d3b7</a>.Lastly, (1) we’d love feedback on Hamilton (<a href="https://github.com/dagworks-inc/hamilton">https://github.com/dagworks-inc/hamilton</a>) and on any of the above, and what we could do better. To stress the importance of your feedback, we’re going all-in on Hamilton. If Hamilton fails, DAGWorks fails. Given that Hamilton is a bit of a “swiss army knife” of what you could do with it, we need help prioritizing features. E.g. we just released experimental PySpark UDF map support, is that useful? Or perhaps you have streaming feature engineering needs where we could add better support? Or you want a feature to auto generate unit test stubs? Or maybe you are doing a lot of time-series forecasting and want more power features in Hamilton to help you manage inputs to your model? We’d love to hear from you! (2) For those interested in the closed source DAGWorks Platform, you can sign up for early access via www.dagworks.io (leave your email, or schedule a call with me) – we apologize for not having a self-serve way to onboard just yet. (3) If there’s something this post hasn’t answered, do ask, we’ll try to give you an answer! We look forward to any and all of your comments![1] <a href="https://news.ycombinator.com/item?id=29417998" rel="nofollow">https://news.ycombinator.com/item?id=29417998</a>[2] <a href="https://news.ycombinator.com/item?id=29158021" rel="nofollow">https://news.ycombinator.com/item?id=29158021</a>

15 comments

nerdponxabout 2 years ago

Data scientist here, stuck in the Dark Ages of "deploying" my models by writing bespoke Python apps that run on some kind of cloud container host like ECS. Dump the outputs to blob storage and slurp them back into the data warehouse nightly using Airflow. Lots of manual fussing around.What the heck are all these ML and data platforms, how do they benefit me, and how do I evaluate the gazillion options that seem to be out there?For example, I recently came across DStack (<a href="https://dstack.ai/" rel="nofollow">https://dstack.ai/</a>) and have had an open browser tab sitting around waiting for me to figure out WTF it even does. DAGWorks seems like it does something similar. Is that true? Are these tools even comparable? How would I choose one or the other? Is there overlap with MLFlow?

评论 #35058657 未加载

评论 #35059781 未加载

评论 #35058521 未加载

sidllsabout 2 years ago

In my experience building the pipeline and related infrastructure is not trivial, but it’s also a relatively tiny problem compared to, well, everything else. That is, acquiring and moving data around, managing the data over the lifetime of a model’s use, and serving adjacent needs (e.g. post-deployment analytics). How does DAGWorks help with all the rest of this stuff?

评论 #35058438 未加载

评论 #35057605 未加载

评论 #35059077 未加载

aldanorabout 2 years ago

> With Hamilton we use a new paradigm in Python (well not quite “new” as pytest fixtures use this approach) for defining model pipelines. Users write declarative functions instead of writing procedural code. For example, rather than writing the following pandas code> These functions then define a "dataflow" or a directed acyclic graph (DAG), i.e. we can create a “graph” with nodes: col_a, col_b, col_c, and col_d, and connect them with edges to know the order in which to call the functions to compute any result.This 'new paradigm' already exists in Polars. Within the scope of a local machine, you can write declarative expressions which can then be used pretty much anywhere for querying instead of the usual arrays and series (arguments to filter/apply/groupby/agg/select etc), allowing it to build an execution graph for each query, optimise it and parallelise it, and try to only run through the data once if possible without cloning. Eg the example above can be written simply as<pre><code> col_c = (pl.col('a') + pl.col('b')).alias('c') </code></pre> It is obviously restricted to what is supported in polars, but a surprising amount of the typical data munging can be done with incredible efficiency, both cpu and ram wise.

评论 #35059533 未加载

评论 #35060501 未加载

评论 #35059585 未加载

评论 #35061124 未加载

ropeladderabout 2 years ago

Congrats on the launch, guys! Hamilton was the first MLOps library that really seemed to fit the challenges we face, because it offered a more granular way to structure our code. Really excited to see what other tools are on the way.

评论 #35057659 未加载

data_dersabout 2 years ago

yo congrats again on the launch! Anders dbt Labs here with a "tough" question for you. Apologies for 1) my response being half-baked,and 2) if i haven't done my homework about Hamilton's features.coincidentally, my PR to the dbt viewpoint was closed by the docs team as "closed, won't do" [1]I really like the convention of data plane (where you describe how the data should be transformed) and the control plane (i.e. the configuration of the DAG, do this before this). In this paradigm, I believe that the control plane should be as simple as possible, and even perhaps limited in what can be done with the goal of pushing the user to take data transformation as tantamount. Maybe this is why I fell in love with dbt in the first place is because it does exactly this."spicy" take: allowing users to write imperative code (e.g. using loops) that dynamically generates DAGs are never a good idea. I say this as someone who personally used to pester framework PMs for this exact feature before. While things like task groups (formerly subDAGs) [2] appear initially to be right answer, I always ended up regretting them. They're a scheduling/orchestration solution to a data transformation problemCan y'all speak to how Hamilton views the data and control plane, and how it's design philosophy encourages users to use the right tool for the job?p.s. thanks for humoring my pedantry and merging this! [3][1]: <a href="https://github.com/dbt-labs/docs.getdbt.com/pull/2390">https://github.com/dbt-labs/docs.getdbt.com/pull/2390</a> [2]: <a href="http://apache-airflow-docs.s3-website.eu-central-1.amazonaws.com/docs/apache-airflow/latest/core-concepts/dags.html#taskgroups-vs-subdags" rel="nofollow">http://apache-airflow-docs.s3-website.eu-central-1.amazonaws...</a> [3]: <a href="https://github.com/DAGWorks-Inc/hamilton/pull/105">https://github.com/DAGWorks-Inc/hamilton/pull/105</a>

评论 #35064120 未加载

评论 #35058940 未加载

edurenabout 2 years ago

Hey Stefan and Elijah, I really like the approach you're taking, especially with Hamilton being the open core.I've got recent experience with data eng / pipleine startups and wondering if you are hiring for your first engineers at this time.

评论 #35063004 未加载

ZeroCool2uabout 2 years ago

Any thoughts on how DAGWorks compared to something like Domino Datalab[1]?1: <a href="https://docs.dominodatalab.com/en/latest/user_guide/bc1c6d/step-0--orient-yourself-to-domino/" rel="nofollow">https://docs.dominodatalab.com/en/latest/user_guide/bc1c6d/s...</a>

评论 #35061117 未加载

joshhartabout 2 years ago

Congrats Stefan, from someone working at a competitor - always good to see more tools for production ML.

评论 #35058126 未加载

评论 #35058002 未加载

slotransabout 2 years ago

As a long-time fan of DAG-oriented tools, congrats on the launch. Maybe you can get added here <a href="https://github.com/pditommaso/awesome-pipeline">https://github.com/pditommaso/awesome-pipeline</a> now or in the future...This is a problem space I've worked in and been thinking about for a very, very long time. I've extensively used Airflow (bad), DBT (good-ish), Luigi (good), drake (abandoned), tested many more, and written two of my own.It's important to remember that DAG tools exist to solve two primary problems, that arise from one underlying cause. Those problems are 1) getting parallelism and execution ordering automatically (i.e. declaratively) based on the structure of dependencies, and 2) being able to resume a partially-failed run. The underlying cause is: data processing jobs take significant wall-clock time (minutes, hours, even days), so we want to use resources efficiently, and avoid re-computing things.Any DAG tool that doesn't solve these problems is unlikely to be useful. From your docs, I don't see anything on either of those topics, so not off to a strong start. Perhaps you have that functionality but haven't documented it yet? I can imagine the parallelism piece being there but just not stated, but the "resumption from partial failure" piece needs to be spelled out. Anyway, something to consider.A couple more things...It looks like you've gone the route of expressing dependencies only "locally". That is, when I define a computation, I indicate what it depends on there, right next to the definition. DBT and Luigi work this way also. Airflow, by contrast, defines dependencies centrally, as you add task instances to a DAG object. There is no right answer here, only tradeoffs. One thing to be aware of is that when using the "local" style, as a project grows big (glances at 380-model DBT project...), understanding its execution flow at a high level becomes a struggle, and is often only solvable through visualization tools. I see you have Graphviz output which is great. I recommend investing heavily in visualization tooling (DBT's graph browser, for example).I don't see any mention of development workflow. As a few examples, DBT has rich model selection features that let you run one model, all its ancestors, all its descendants, all models with a tag, etc etc. Luigi lets you invoke any task as a terminal task, using a handy auto-generated CLI. Airflow lets you... run a single task, and that's it. This makes a BIG DIFFERENCE. Developers -- be they scientists or engineers -- will need to run arbitrary subgraphs while they fiddle with stuff, and the easier you make that, the more they will love your tool.Another thing I notice is that it seems like your model is oriented around flowing data through the program, as arguments / return values (similar to Prefect, and of course Spark). This is fine as far as it goes, but consider that much of what we deal with in data is 1) far too big for this to work and/or 2) processed elsewhere e.g. a SQL query. You should think about, and document, how you handle dependencies that exist in the World State rather than in memory. This intersects with how you model and keep track of task state. Airflow keeps task state in a database. DBT keeps task state in memory. Luigi track task state through Targets which typically live in the World State. Again there's no right or wrong here only tradeoffs, but leaning on durable records of task state directly facilitates "resumption from partial failure" as mentioned above.Best of luck.

评论 #35062988 未加载

marsupialtail_2about 2 years ago

would love to collaborate on an integration with pyquokka (<a href="https://github.com/marsupialtail/quokka">https://github.com/marsupialtail/quokka</a>) once I put out a stable release end of this month :-)

评论 #35061146 未加载

jdonaldsonabout 2 years ago

Can this be set up to yield data from individual functions instead of simply returning it?

评论 #35057835 未加载

cbb330about 2 years ago

How can I convince someone to try this, if they are comparing this solution with dbt? Can only pick one :)

评论 #35061766 未加载

nothrowawaysabout 2 years ago

I honestly prefer the first approach in the example given.

ericcolsonabout 2 years ago

love the transparency this brings. Any 3rd party tools with plans to integrate with it? (e.g. analytics layer companies?)

评论 #35058286 未加载

sampoabout 2 years ago

> Most companies running ML in production need a ratio of 1:1 or 1:2 data scientists to engineers. At bigger companies like Stitch Fix, the ratio is more like 1:10 — way more efficientDid you write these wrong way round, maybe? Or are you saying a ratio of 1 data scientist to 10 engineers is efficient?

评论 #35060762 未加载

评论 #35060665 未加载