I think you're 100% right that the tasks that can be accomplished in Airflow are currently being unbundled by tools in the modern data stack, but that doesn't erase the need for tools like Airflow. Sure, you can now write less code to load your data, transform it, and send it out to other tools. As the unbundling occurs, the end result is more fragmentation and fragility in how teams manage their data.<p>Data teams I talk to can't turn to any single location to see every touchpoint their data goes through. They're relying on each tool's independent scheduling system and hoping that everything runs at the right time without errors. If something breaks, bad data gets deployed and it becomes a mad scramble to verify which tool caused the error and which reports/dashboards/ML models/etc. were impacted downstream.<p>While these unbundled tools can get you 90% of the way to your desired end goal, you'll inevitably face a situation where your use case or SaaS tool is unsupported. In every situation like this I've ever faced, the team ultimately ends up writing and managing their own custom scripts to account for this situation. Now you have your unbundled tool + your custom script. Why not just manage all of the tools and your scripts from a singular source in the first place?<p>While unbundling is the reality, this new era of data technology will always still have a need for data orchestration tools that serve as a centralized view into your data workflows, whether that's Airflow or any of the new players in the space.<p>(Disclosure: I'm a co-founder of <a href="https://www.shipyardapp.com/" rel="nofollow">https://www.shipyardapp.com/</a>, building better data orchestration for modern data teams)
As a newcomer to the world of data, I have no strong opinions about Airflow. It replaced a bunch of disparate cron jobs, so it's definitely better than what was there before.<p>There are things I like and things I don't about it. The UI is awful -- I don't know anyone that likes it, unlike what the article states. I like that it's centralized and that it's all Python code.<p>Deploying it and fine-tuning the config for a variety of workloads can be a pain. Sometimes sensors don't work right. Tasks sometimes get evicted and killed for obscure reasons. Zombie tasks are a pain big enough you'll see plenty of requests for help online.<p>That said, replacing it with a bunch of disparate tools again? Seems like a step backwards. And now instead of a single tool, your org has to vet, secure, understand and monitor a bunch of different tools? It's bad enough with only one...<p>What am I missing?<p>PS: data analysis/engineering as a field seems new and immature enough that, in my humble opinion, we should be focusing on developing good practices and theory, instead of deprecating existing (and pretty recent) tech at an ever increasing pace.
This post is hard to follow. But I'll give my unsolicited opinion on airflow:<p>Its too complex to run as a single team and there are far better tools out there for scheduling. Airflow only makes sense when you need complex logic surrounding when to run jobs, how to backfill, when to backfill, and complex dependency trees. Otherwise, you are much better off with something like AWS step functions.
If I put each select statement in its own Airflow task, I get the same lineage dbt gives me, except I can see it and administer to it alongside all my other E and L-type tasks.<p>Also, I can write my T in plain ol’ SQL (granted, with some jinja) instead of this dbt-QL that I can’t copy and paste into my database console or share with a non-dbt user.<p>So, folks who have adopted dbt: what am I missing by being a fuddy-duddy?
I love Airflow. Plenty of data businesses I've built are nothing more than just one DAG.<p>As for the article, I don't think we are yet at the point in which a competing stack comprised of individual specialized components do things better since Airflow is more than the sum of its parts imho.
Well written. I think that airflow is being enforced in organizations as the main orchestrator even though it's not always the right too for the job. In addition, organizations has to enforce a micro-services approach to have modular components. Besides that managing those frameworks is a nightmare. We built Ploomber (<a href="https://github.com/ploomber/ploomber" rel="nofollow">https://github.com/ploomber/ploomber</a>) specifically for this reason, modular components and easy deployments. It standardize your pipelines and allows you to deploy seamlessly on Airflow, Argo (Kubernetes), Kubeflow and cloud providers.
I have used airflow with two different organizations over the past couple years. When we had a complex orchestration with critical pipelines and enough human-power to manage the system, it was great. Trying to deploy it for a small team with no critical pipelines has been overkill and we recently migrated to dagster which is still in beta but accomplishes 90% of what airflow does with a much smaller footprint.
You can "unbundle" Airflow into different components. What is it called when you take one thing and break it into many pieces? Distributed (<i>sometimes</i> decentralized) computing. What do you get when you take a single system and distribute/decentralize it? Complexity. And what's the best way to simplify complexity? Consolidate the complexity into one system.<p>The Circle of Computing Complexity.
I like this post, because in many ways it highlights the importance of how Airflow has helped shape the modern data stack.<p>Like mentioned in this thread, managing Airflow can quickly become complicated. Its flexibility means that you can stretch Airflow in pretty interesting ways. Especially when trying to pair container orchestrators like k8s with it.<p>To combat that complexity and reduce the operational burden of letting a data team create & deploy batch processing pipelines we created <a href="https://github.com/orchest/orchest" rel="nofollow">https://github.com/orchest/orchest</a><p>We suspect that many standardized use cases (like reverse ETL) will start disappearing from custom batch pipelines. But there’s a long tail of data processing tasks for which having freedom to invoke your language of choice has significant advantages. Not to mention stimulating innovative ideas (why not use Julia for one of your processing steps?).
Tools like prefect.io, IMHO, are just this. A 'modular' Airflow where you can pick and choose to use the just the DAG with no GUI, workflow GUI, scheduling, runners of all types from local to k8s.
It's ok, but seems to be a bit too complex for what it does. It was pretty janky running it locally (pegged the CPU), and now that we have it in MWAA we've got several support issues on it with AWS for unkillable task instances and scheduler problems.
I found Airflow to be extremely buggy when I started deploying it for use with a small/medium sized team. Did anybody else have a similar impression, or has it gotten better with later versions?
This is fine and will allow Airflow to focus on it's core functionality of being a distributed job scheduler.<p>FWIW, last I looked at Airflow I thought the schedule+task model could be made tighter as their was numerous ways to enter inconsistent states. For example, changing the schedule after tasks had already been run would allow to rerun jobs (in the past) at dates that were never scheduled in the first place.
High Resolution Version of the diagram if anyone is interested<p><a href="https://drive.google.com/file/d/1btZ0yck9SdgsUdNom0WXgHcSQvOJdcGR/view" rel="nofollow">https://drive.google.com/file/d/1btZ0yck9SdgsUdNom0WXgHcSQvO...</a>
Funny enough this post mirrors quite a bit of our thinking over at Dagster! <a href="https://dagster.io/blog/rebundling-the-data-platform" rel="nofollow">https://dagster.io/blog/rebundling-the-data-platform</a>
I had a pretty terrible experience doing devops to automate the setup of an Airflow setup in 2020. This was before 2.0; I assume a lot of the bugs and issues may have been at least partially addressed.<p>My main gripes:<p>- The out of the box configuration is not something you should use in production. It's basically using python multiprocess (yikes) and sqlite like you would on a developer machine. Instead, you'll be using dedicated workers running on different machines and either a database or redis in between.<p>- Basically the problem is that python is single threaded (the infamous gil) and has synchronous (IO). And that kind of sucks when you are building something that ought to be asynchronous and running on multiple threads, cores, cpus, and machines. It's not a great language for that kind of job. Mostly in production it acts as a facade for stuff that is much better at such things (kubernetes, yarn, etc.).<p>- Most of the documentation is intended for people doing stuff on their laptops, not for people trying to actually run this in a responsible way on actual servers. In our case that meant referring to third party git repositories with misc terraform, aws, etc. setup to figure out what configuration was needed to run it in a more responsible way.<p>- Python developers don't seem to grasp the notion that installing a lot of python dependencies on a production server is not a very desirable thing. Doing that sucks, to put it mildly. Virtual environments help. Either way, that complicates deployment of new dags to production. That severely limits what you should be packaging up as a dag and what you should be packaging up with e.g. docker.<p>- What that really means is that you should be considering packaging up most of your jobs using e.g. Docker. Airflow has a docker runner and a kubernetes runner. I found using that to be a bit buggy but we managed to patch our way around it.<p>- Speaking of docker, at the time there was no well supported dockerized setup for Airflow. We found multiple unsupported bits of configuration for kubernetes by third parties though. That stuff looked complicated. I quickly checked and at least they now provide a docker-compose for a setup with postgresql and redis; so that's an improvement.<p>- The UI was actually worse than jenkins and that's a bit dated to say the least. Very web 1.0. I found my self hitting F5 a lot to make it stop lying about the state of my dags. At least Jenkins had auto reload. I assume somebody might have fixed that by now but the whole thing was pretty awful in terms of UX.<p>- Actual dag programming and testing was a PITA as well. And since it is python, you really do need to unit test dags before you deploy them and have them run against your production data. A small typo can really ruin your day.<p>We got it working in the end but it was a lot of work. I could have gotten our jobs running with jenkins in under a day easily.
Tl;dr 2nd to last paragraph says that Airflow's unbundling is better than writing a better Airflow. Final paragraph says that DBT Cloud will become the better Airflow.