I recently had the experience of setting up some Prefect pipelines, which I can compare to this article. Note that while I'm not new to data engineering, I'm new to open source frameworks, and have some insight into Airflow (studied architecture in depth, written a lot of code in it).<p>Prefect is generally very easy to use. Essentially, you: (a) write a Python-based flow, which defines some job to run (with subtasks), (b) turn on an orchestrator on a server somewhere, (c) turn on an agent on a server somewhere (to run the flow when instructed by the orchestrator), and (d) connect to the orchestrator, build & apply a deployment, and run it.<p>I find the docs a little half baked right now. One example is that cron jobs, which one would think are essential to something like Prefect, basically can't be done (as of a month ago) without touching the Prefect UI. This is extremely odd.<p>I also found it fairly confusing which components were supposed to be checked into source control, and which weren't. I blame this on Python deployment generally being very odd and confusing, but Prefect docs don't make it any more clear. Prefect assumes that there's an S3-like storage that both the submitting computer (my laptop) and the orchestrator (the server) can access.<p>Overall I find it quite handy, and probably won't switch. It feels more lightweight than say using full Docker containers, which we probably don't need right now. The UI is nicer than Airflow's, and the orchestrator & agent are much easier on resources. It feels more reproducible. I haven't tried Prefect Cloud, and we're unlikely to (security & cost are the main reasons).
I hear grumbling from a friend on the data engineering side of my company. It takes an enormous amount of effort to stay on top of their data pipelines, and they still have lots of failures in cleaning, transformation, orchestration, and reporting. One product wasn't updated in 5 months !<p>They've tried everything airflow, informatica, alteryx, etc.
They've even built their own custom data flow etl in python.<p>I often wonder if the real issues they face is more about expectations and standards such centralized logging, easy report/artifact generation, ops management, and hiring more developer oriented data engineers.
Dagster's product is great, but comparing it to MWAA is unfair. MWAA is a poor quality product - difficult to use, unstable, inflexible, and poorly-supported. A fairer comparison would be against Astronomer. Astronomer is a MUCH better product than MWAA.
I tried (and tried, and tried, and tried) once to set up a local "test" instance of Airflow just to try out a few different things and understand better how the whole thing worked. I finally gave up after a week - I've never come across any software that I couldn't install, but Airflow just ended up being too much.
The author seems to think that Dagster or Prefect will take over Airflow, I don't think this is true. All of them being open source means that if one has a good idea or better way of doing something the other can quickly implement the feature and even use the some of same code. We saw it with Airflow implementing the TaskFlow API as a response to Dagster, and in a few weeks Airflow 2.4 is going to have dataset scheduling released.
So, Airflow's head start is going to be extremely hard to over come, if they remain adaptable.<p>Also, as other's have mentioned the comparison to MWAA is unfair, the true own of Airflow is Astronomer as they have over 50% of the commits to Airflow, and Astronomer is a much better product the MWAA.
sounds like a very positive experience all around, but not going too deep into dagster itself yet. i feel like comparing MWAA to Dagster is not an even footing, i'd be interested in seeing how Astronomer has also improved the Airflow experience.
An often overlooked framework used by NASA among others is Kedro <a href="https://github.com/kedro-org/kedro" rel="nofollow">https://github.com/kedro-org/kedro</a>. Kedro is probably the simplest set of abstractions for building pipelines but it doesn't attempt to kill Airflow. It even has an Airflow plugin that allows it to be used as a DSL for building Airflow pipelines or plug into whichever production orchestration system is needed.
Where is a service that doesn't focus specifically on<p>1.) data pipelines / data-science<p>2.) cicd / build pipelines<p>3.) ... you name it<p>I mean, just a service that gives me all the groundwork to build one of the above myself.<p>Is there something like that?
I've been using ploomber over the last year to build ML pipelines. It's good for both dev/prod workflows. The other frameworks were too bulky for a small team with little infra support.
Disclaimer: I am building a no-code, SQL-focused data pipeline platform to improve the experience around data pipelines, see my profile for more. The idea is to replace all of Fivetran, DBT, Airflow and more with a single platform that can handle all with no code.<p>I like the fact that the articles the author has are walking the reader through their own journey, along with the learnings and opinions after the trial periods. I am curious to hear how Dagster will evolve for their usage.<p>One of the things that has been a big discussion point among data folks that I have been talking to is that people that haven't done ops before underestimate the amount of operations that go into managing an Airflow instance, there are still quite a lot of stuff to be figured out:
- how do I get my pipeline code there?
- where do I execute them?
- how do I setup my development environment?
- how do I make sure the platform is up and running?
- how do I know if a task fails?
- where do I store my logs?
- how do I scale my setup?<p>If you are using a managed solution like MWAA or Cloud Composer, some of these questions might go away, but some aren't. As it stands today, Airflow is a powerful but hard-to-use technology; in my opinion, it is less of a tool that is supposed to be used directly by data engineers / analysts, and more of a platform that should enable easier-to-use platforms for its internal users.<p>In that sense, I believe Dagster is hitting the right chord: they focus on the pain points in DX for Airflow and similar solutions, they have figured out how to do development branches, they are focusing on assets rather than tasks, and they are constantly improving their product as far as I can tell from the outside. However, it is still a platform that you have to have engineers writing code for it: asking a data analyst to write python code to schedule a few SQL queries still adds a huge barrier to entry.<p>I am excited to see all the innovation that is happening in this space. I find the point about people getting used to the baggage around Airflow to be a quite real problem, and I am very happy to see solutions like Dagster are gaining speed. All in all, it is a very large space and there are many different problems that need to be solved to add the data abilities larger organizations have to smaller companies with limited budgets.