I really like this!<p>I bootstrapped the ETL and data pipeline infrastructure at my last company with a combination of Bash, Python, and Node scripts duct-taped together. Super fragile, but effective[3]. It wasn't until about 3 years in (and 5x the initial revenue and volume) that it started having growing pains. Every time I tried to evaluate solutions like Airflow[1] or Luigi[2], there was just so much involved with getting it going reliably and migrating things over that it just wasn't worth the effort[4].<p>This seems like a refreshingly opinionated solution that would have fit my use case perfectly.<p>[1] <a href="https://airflow.apache.org/" rel="nofollow">https://airflow.apache.org/</a><p>[2] <a href="https://github.com/spotify/luigi" rel="nofollow">https://github.com/spotify/luigi</a><p>[3] The operational complexity of real-time, distributed architectures is non-trivial. You'd be amazed how far some basic bash scripts running on cron jobs will take you.<p>[4] I was a one man data management/analytics/BI team for the first two years, not a dedicated ETL resource with time to spend weeks getting a PoC based on Airflow or Luigi running. When I finally got our engineering team to spend some time on making the data pipelines less fragile, instead of using one of these open source solutions they took it as an opportunity to create a fancy scalable, distributed, asynchronous data pipeline system built on ECS, AWS Lambda, DynamoDB, and NodeJS. That system was never able to be used in production, as my fragile duct-taped solution turned out to be more robust.
I'm sure this is right for someone, everyone has different requirements, but I don't really want a lighter-weight Airflow. I want an Airflow that runs and scales in the cloud, has extensive observability (monitoring, tracing), has a full API, and maybe some clear way to test workflows.<p>I was looking into how Google's Cloud Composer is run, which is a managed Airflow service. They use gcsfuse to mount a directory for logs, because Airflow insists on writing logs to local disk with no cleanup system, even if you configure logs to be sent to S3/GCS. To health check the scheduler they query Stackdriver Logging to see if has logged <i>anything</i> in the last five minutes, because the scheduler has no /healthz or other way to check health. There is no built it way to monitor workflows, so you can't easily do something like graph failures by workflow, email on failure is about all you get. A GUI-first app that requires local storage is not what I expect these days.
What is there in the ETL space with bi-directional sync?<p>I don't usually run into problems where "transfer data from X to Y" is it. Usually it's "there's data in CRM X and data in Event system Y, merge the two keeping X as the master source"<p>There's Mulesoft et al but they seem overkill for small deployments, as well as being stupidly expensive [1].<p>1. I'm sure they're good value if you're an enterprise company. But if you don't need the UI builder and only 2-3 sources kept in sync, they are expensive. And with my reading of the documentation, conflict resolution isn't great either.
I'm interested in hearing thoughts from people who've used digdag (<a href="https://www.digdag.io/" rel="nofollow">https://www.digdag.io/</a>) or pachyderm (<a href="http://www.pachyderm.io/" rel="nofollow">http://www.pachyderm.io/</a>). Pachyderm is the most interesting to me. It seems to be focussing on the data as well as the data processing.
Here is a conference talk presenting the framework and the ideas behind: <a href="https://youtu.be/GdtFuOah-5c" rel="nofollow">https://youtu.be/GdtFuOah-5c</a>
Send a pull request to add it to <a href="https://github.com/pditommaso/awesome-pipeline/blob/master/README.md" rel="nofollow">https://github.com/pditommaso/awesome-pipeline/blob/master/R...</a>
In my experience one of the places where every framework breaks down is when you must combine / reconcile multiple rows from multiple data sources to produce one row in a fact table.<p>Does there exist a "framework" that lets me do this simply?
As the one who implemented Airflow at my company, I understand how overwhelming it can be, with the DAGs, Operators, Hooks and other terminologies.<p>This looks like a good enough mid-term alternative. However, I have a few questions (which I couldn't find easily in the homepage, sorry if I skipped something):<p>- Do you have a way of persisting connection information? I saw an example of how to create a connection, but it isn't clear if the piece of code has to be loaded every time you execute the ETL<p>- How easy it is to implement new computation engines?<p>- Plans of creating a command line to make it easier to execute operations?
...but why?<p>Just use airflow.<p>Things I want in an ETL:<p>[x] works at scale.<p>[x] simple to use.<p>[x] not written in python (eg. in go or rust)<p>[x] easy to scale (eg. in docker)<p>[ ] this.
Reflow [1] is also well-suited for ETL workloads. It takes a different tack: it presents a DSL with data-flow semantics and first-class integration with Docker. The result is that you don't write graphs, instead you just write programs that, due to their semantics, can be automatically parallelized and distributed widely, all intermediate evaluations are memoized, and programs are evaluated in a fully incremental fashion:<p>[1] <a href="https://github.com/grailbio/reflow" rel="nofollow">https://github.com/grailbio/reflow</a>
That looks very interesting indeed, and love to see some more development in that space.<p>Would anyone care to explain how it differs from Airflow? I had dismissed Airflow some months ago for my use case (Windows, and large number of dependencies were an issue at the time), but would still like to eventually migrate my ETL scripts to a solid framework sometime in the future.
Just thought I would add:<p>If you want something more serious that supports better scale, realtime/streaming, is written in a statically typed language check out <a href="https://kylo.io/" rel="nofollow">https://kylo.io/</a> it's added on to Apache NiFi (which was developed by our friends at the NSA)
Why PostgreSQL only? The mara-DB dependency [1] claims to support more.<p>[1] <a href="https://github.com/mara/mara-db" rel="nofollow">https://github.com/mara/mara-db</a>
This is slightly weird...but I named my dog Mara: <a href="https://flic.kr/p/FtdRNX" rel="nofollow">https://flic.kr/p/FtdRNX</a>
What's the least verbose/boilerplate-heavy tool in your experience?<p>We couldn't make it leaner than this (works well in production in scale).
<a href="https://github.com/wunderlist/night-shift" rel="nofollow">https://github.com/wunderlist/night-shift</a>
If we could get rid of Ruby in there (super useful for scripting) and fly with only Python I'd be the happiest person on Earth.<p>Also we started to go cloud agnostic, it handles both AWS and Azure. Do you something that does AWS, Azure _and_ Google Cloud also?
the docs animations inline add to casual-browsing fun
<a href="https://github.com/mara/mara-example-project" rel="nofollow">https://github.com/mara/mara-example-project</a>
Composable is amazing at ETL - <a href="https://composableanalytics.com" rel="nofollow">https://composableanalytics.com</a><p>It blows things like Alteryx, NiFi, Airflow out of the water.