ETL Pipelines with Airflow: The Good, the Bad and the Ugly

147 pointsby Arimbrover 3 years ago

17 comments

0x500x79over 3 years ago

Some people have noted that this is a very Airbyte specific article, but I think that the lessons learned are still important.I have managed Airflow as a managed service for a company that has thousands of DAGs and one of our keys to success was splitting the compute and scheduling concepts into different components. We standardized on where our compute ran (Databricks, Spark, Lambdas, or K8s jobs) and had Airflow purely as a scheduler/orchestration tool.Scaling your Airflow worker nodes to handle big-data scale transformations/extractions is a pain. Especially attempting to support customers who want to run larger and larger jobs. Splitting these concepts allowed for us to prevent noisy neighbor issues, Airflow as a component had high reliability for all of our customers, and we prevented the need for M * N operators.

评论 #28801885 未加载

评论 #28801307 未加载

yamrzouover 3 years ago

I have worked with Airflow during the past three years, but recently we adopted Dagster and I have been using it for the past 3 months. I have found it quite joyful to use and the experience has been very positive. Its main advantages compared to Airlfow (IMO):<pre><code> - A great UI - It forces you to clearly define inputs, outputs and types. - Separation of concerns: Between configuration and data, between processing and IO, and between code and deployment options. - It allows you to define flexible dags which you can configure at runtime, which makes it easiy to run locally or in k8s, or to switch the storage backend depending on the environment. </code></pre> This blog post by the founder outlines the differences between the two in much more detail: <a href="https://dagster.io/blog/dagster-airflow" rel="nofollow">https://dagster.io/blog/dagster-airflow</a>

评论 #28803250 未加载

评论 #28832024 未加载

Arimbrover 3 years ago

[author of the article] My main concern about using Airflow for the EL parts is that sources and destinations are highly coupled with Airflow transfer operators (e.g. PostgresToBigQueryOperator). The community needs to provide M * N operators to cover all possible transfers. Other open-source projects like Airbyte, decouple sources from destinations, so the community only needs to contribute 2 * (M + N) connectors.Another concern about using Airflow for the T part is that you need to code the dependencies between models both in your SQL files and your Airflow DAG. Other open-source projects like dbt create a DAG from the model dependencies in the SQL files.So I advocate for integrating Airflow scheduler with Airbyte and dbt.Curious to know how other use Airflow for ETL/ELT pipelines?

评论 #28800069 未加载

评论 #28802859 未加载

评论 #28800083 未加载

评论 #28803489 未加载

评论 #28802021 未加载

评论 #28800045 未加载

willvarfarover 3 years ago

Very solid article. Even with where it’s published, it’s jolly sensible.I would like to jump in and say use Beam instead of DBT, but tbh that’s bad advice. What the world needs is something open source with the incremental model of beam, a fast incremental backend (thinking htap storage that mixes columns and rows automagically) and the ease and maintainability of DBT. There is just this massive hole. If some combination of tools could fill it, that would be the new LAMP stack for data.

评论 #28800544 未加载

评论 #28800450 未加载

评论 #28801111 未加载

ltbarcly3over 3 years ago

This is just advertising copy. It isn't giving unbiased advice, there is an obvious conflict of interest here.I was hoping to learn some things to help me avoid common airflow problems, but it only talks about database sync jobs for the most part and jumps into pitching Airbyte over and over, the last 2/3 of the article being a sales pitch, right out of a marketing class.Hi to the author and other Airbyte employees pushing this to the frontpage! I hope you didn't have to get up too early to coordinate your voting. Make sure to give this comment a downvote so we know you are out there!

swyxover 3 years ago

> The main issue with Airflow transfer operators is that if you want to support transfers from M sources to N destinations, the community would need to code N x M Airflow operators.I'm biased but this is a nonissue with workflow-as-code solutions like temporal.io (which Airbyte uses). N activities pulling data from sources, M activities sending data to destinations, write whatever translation layers you want in your workflows.links to examples <a href="https://temporal.io/usecases#Pipelines" rel="nofollow">https://temporal.io/usecases#Pipelines</a> and our community meetup where Airbyte spoke about their needs <a href="https://www.youtube.com/watch?v=K25Bt5asd8I" rel="nofollow">https://www.youtube.com/watch?v=K25Bt5asd8I</a>

mkw5053over 3 years ago

Having always been on AWS and using Glue Spark Jobs for my jobs, I've never felt any benefit of using Airflow for orchestration over Glue Workflows. I can understand some people not wanting to deal with vendor lock-in. I'm curious what others opinions are.

评论 #28801516 未加载

iroddisover 3 years ago

I recently started working on my own DAG execution framework, after failing to get some patches into Airflow to make the scheduling easier to reason about.My typical use case was orchestrating DAGs with thousands of vertices, and airflow would silently wedge itself and fail to report errors.Daggy is just starting out, but I’m hoping it’ll become more robust and scalable as time goes on.<a href="https://gitlab.com/iroddis/daggy" rel="nofollow">https://gitlab.com/iroddis/daggy</a>

评论 #28806694 未加载

fredliuover 3 years ago

I've never used Airflow, but used Step Function in AWS to pretty much achieved the same things this article described. I wonder if anybody has used both and what are the pros and cons between them? Besides the obvious reason of Step Function in AWS so it would work better within AWS ecosystem and Airflow is open source and service/provider agnostic?

评论 #28801428 未加载

评论 #28806170 未加载

评论 #28801229 未加载

TrealTwanover 3 years ago

The article says that "SQL is taking over Python to transform and analyze data in the modern data stack". Are other people starting to notice this at ELT becomes more populate than traditional ETL?Haven't used Airflow before but use Azure Data Factory in my org to load the raw the data into the data warehouse and then transform into data models using SQL.

评论 #28804419 未加载

评论 #28806128 未加载

评论 #28805833 未加载

raj_singh2021over 3 years ago

Airflow is really nice and awesome solutions for simple workflow and tasks. But when we use Complex DAGs with too many subDAGs then real pain starts. Real issue with airflow is - 1. scalability 2. Scheduler delays. 3. security ( Open source airflow ) - A ) Airflow uses single super role that has access to resources for all its orchestration jobs which is potential compliance risks.<pre><code> B ) Lack of granular roles and security groups which leads to rely on trust that no airflow users mistakenly make any changes through UI </code></pre> I feel there is some undocumented dependency between scheduler, celery and web server which always hit performance issue of ETL job.Also We see more reliability issues on the platform as more workloads are added.

SPascareli13over 3 years ago

Something I've been thinking is, like the article says, SQL is a very good way to transform data, and it recommends dbt for it, but how to you test this transformation?I know dbt has tests, but on a superficial look they seem pretty trivial stuff like "check if this field is null" and things like that, but what about tests which I setup scenarios and see if the end result of my transformation is what I expect? Is there any good tools for this?

评论 #28809042 未加载

评论 #28804930 未加载

zepmckover 3 years ago

If you have GPUs, NVTabular outperforms most of the frameworks out there: <a href="https://github.com/NVIDIA/NVTabular" rel="nofollow">https://github.com/NVIDIA/NVTabular</a>

41b696ef1113over 3 years ago

Has anyone kicked the tires on Airflow, Prefect, and Dagster and care to give their thoughts? My initial foray into Airflow 1 met a lot of complexity that both Prefect and Dagster claim to minimize.

评论 #28806355 未加载

agustifover 3 years ago

I once made a simple ETL concept library for a coding challenge, I actually never published it, and it's not like performant or anything, but maybe I should open source it?

slotransover 3 years ago

Blows my mind that they recommend keeping Airflow just for the scheduler, which is basically the WORST part of an overall bad tool.Use Jenkins. You'll be happy you did.

评论 #28806060 未加载

psychometryover 3 years ago

If your ETL code is written in Python, just use Prefect and thank me later: <a href="https://docs.prefect.io/" rel="nofollow">https://docs.prefect.io/</a>

评论 #28803258 未加载