Mara: A lightweight ETL framework, halfway between plain scripts and Airflow

276 pointsby stadeschuldtabout 7 years ago

25 comments

cosmieabout 7 years ago

I really like this!I bootstrapped the ETL and data pipeline infrastructure at my last company with a combination of Bash, Python, and Node scripts duct-taped together. Super fragile, but effective[3]. It wasn't until about 3 years in (and 5x the initial revenue and volume) that it started having growing pains. Every time I tried to evaluate solutions like Airflow[1] or Luigi[2], there was just so much involved with getting it going reliably and migrating things over that it just wasn't worth the effort[4].This seems like a refreshingly opinionated solution that would have fit my use case perfectly.[1] <a href="https://airflow.apache.org/" rel="nofollow">https://airflow.apache.org/</a>[2] <a href="https://github.com/spotify/luigi" rel="nofollow">https://github.com/spotify/luigi</a>[3] The operational complexity of real-time, distributed architectures is non-trivial. You'd be amazed how far some basic bash scripts running on cron jobs will take you.[4] I was a one man data management/analytics/BI team for the first two years, not a dedicated ETL resource with time to spend weeks getting a PoC based on Airflow or Luigi running. When I finally got our engineering team to spend some time on making the data pipelines less fragile, instead of using one of these open source solutions they took it as an opportunity to create a fancy scalable, distributed, asynchronous data pipeline system built on ECS, AWS Lambda, DynamoDB, and NodeJS. That system was never able to be used in production, as my fragile duct-taped solution turned out to be more robust.

评论 #17031332 未加载

评论 #17031373 未加载

评论 #17032443 未加载

评论 #17030899 未加载

评论 #17033107 未加载

评论 #17031740 未加载

antoncohenabout 7 years ago

I'm sure this is right for someone, everyone has different requirements, but I don't really want a lighter-weight Airflow. I want an Airflow that runs and scales in the cloud, has extensive observability (monitoring, tracing), has a full API, and maybe some clear way to test workflows.I was looking into how Google's Cloud Composer is run, which is a managed Airflow service. They use gcsfuse to mount a directory for logs, because Airflow insists on writing logs to local disk with no cleanup system, even if you configure logs to be sent to S3/GCS. To health check the scheduler they query Stackdriver Logging to see if has logged anything in the last five minutes, because the scheduler has no /healthz or other way to check health. There is no built it way to monitor workflows, so you can't easily do something like graph failures by workflow, email on failure is about all you get. A GUI-first app that requires local storage is not what I expect these days.

评论 #17031487 未加载

评论 #17033223 未加载

评论 #17031422 未加载

评论 #17035954 未加载

评论 #17032754 未加载

评论 #17031427 未加载

notamyabout 7 years ago

What exactly IS an "ETL framework"? I looked at both this project and Apache Airflow, and I'm not quite sure I understand...

评论 #17030407 未加载

评论 #17030389 未加载

评论 #17030382 未加载

评论 #17032402 未加载

评论 #17031076 未加载

评论 #17033264 未加载

评论 #17030916 未加载

porkerabout 7 years ago

What is there in the ETL space with bi-directional sync?I don't usually run into problems where "transfer data from X to Y" is it. Usually it's "there's data in CRM X and data in Event system Y, merge the two keeping X as the master source"There's Mulesoft et al but they seem overkill for small deployments, as well as being stupidly expensive [1].1. I'm sure they're good value if you're an enterprise company. But if you don't need the UI builder and only 2-3 sources kept in sync, they are expensive. And with my reading of the documentation, conflict resolution isn't great either.

评论 #17031341 未加载

评论 #17032448 未加载

评论 #17036184 未加载

groodtabout 7 years ago

I'm interested in hearing thoughts from people who've used digdag (<a href="https://www.digdag.io/" rel="nofollow">https://www.digdag.io/</a>) or pachyderm (<a href="http://www.pachyderm.io/" rel="nofollow">http://www.pachyderm.io/</a>). Pachyderm is the most interesting to me. It seems to be focussing on the data as well as the data processing.

评论 #17034057 未加载

stadeschuldtabout 7 years ago

Here is a conference talk presenting the framework and the ideas behind: <a href="https://youtu.be/GdtFuOah-5c" rel="nofollow">https://youtu.be/GdtFuOah-5c</a>

samuellabout 7 years ago

Send a pull request to add it to <a href="https://github.com/pditommaso/awesome-pipeline/blob/master/README.md" rel="nofollow">https://github.com/pditommaso/awesome-pipeline/blob/master/R...</a>

评论 #17031179 未加载

occams_chainsawabout 7 years ago

Seems interesting, but I'll have to dismiss it for now due the total lack of tests

endlessvoid94about 7 years ago

In my experience one of the places where every framework breaks down is when you must combine / reconcile multiple rows from multiple data sources to produce one row in a fact table.Does there exist a "framework" that lets me do this simply?

评论 #17032615 未加载

thalesmelloabout 7 years ago

As the one who implemented Airflow at my company, I understand how overwhelming it can be, with the DAGs, Operators, Hooks and other terminologies.This looks like a good enough mid-term alternative. However, I have a few questions (which I couldn't find easily in the homepage, sorry if I skipped something):- Do you have a way of persisting connection information? I saw an example of how to create a connection, but it isn't clear if the piece of code has to be loaded every time you execute the ETL- How easy it is to implement new computation engines?- Plans of creating a command line to make it easier to execute operations?

评论 #17032382 未加载

mooredsabout 7 years ago

I'd love this, except for ruby. Anyone know of anything like this?

评论 #17032554 未加载

评论 #17032238 未加载

评论 #17030790 未加载

shadowmintabout 7 years ago

...but why?Just use airflow.Things I want in an ETL:[x] works at scale.[x] simple to use.[x] not written in python (eg. in go or rust)[x] easy to scale (eg. in docker)[ ] this.

评论 #17030620 未加载

评论 #17030450 未加载

评论 #17030562 未加载

评论 #17030854 未加载

评论 #17030358 未加载

评论 #17035242 未加载

评论 #17030982 未加载

评论 #17034671 未加载

mariusaeabout 7 years ago

Reflow [1] is also well-suited for ETL workloads. It takes a different tack: it presents a DSL with data-flow semantics and first-class integration with Docker. The result is that you don't write graphs, instead you just write programs that, due to their semantics, can be automatically parallelized and distributed widely, all intermediate evaluations are memoized, and programs are evaluated in a fully incremental fashion:[1] <a href="https://github.com/grailbio/reflow" rel="nofollow">https://github.com/grailbio/reflow</a>

leblancfgabout 7 years ago

That looks very interesting indeed, and love to see some more development in that space.Would anyone care to explain how it differs from Airflow? I had dismissed Airflow some months ago for my use case (Windows, and large number of dependencies were an issue at the time), but would still like to eventually migrate my ETL scripts to a solid framework sometime in the future.

评论 #17031061 未加载

评论 #17030807 未加载

frugalmailabout 7 years ago

Just thought I would add:If you want something more serious that supports better scale, realtime/streaming, is written in a statically typed language check out <a href="https://kylo.io/" rel="nofollow">https://kylo.io/</a> it's added on to Apache NiFi (which was developed by our friends at the NSA)

reinhardtabout 7 years ago

Why PostgreSQL only? The mara-DB dependency [1] claims to support more.[1] <a href="https://github.com/mara/mara-db" rel="nofollow">https://github.com/mara/mara-db</a>

评论 #17037005 未加载

samuellabout 7 years ago

I really like the interactive terminal-based menu, that's neat!

tjr225about 7 years ago

This is slightly weird...but I named my dog Mara: <a href="https://flic.kr/p/FtdRNX" rel="nofollow">https://flic.kr/p/FtdRNX</a>

评论 #17037028 未加载

soobrosaabout 7 years ago

What's the least verbose/boilerplate-heavy tool in your experience?We couldn't make it leaner than this (works well in production in scale). <a href="https://github.com/wunderlist/night-shift" rel="nofollow">https://github.com/wunderlist/night-shift</a> If we could get rid of Ruby in there (super useful for scripting) and fly with only Python I'd be the happiest person on Earth.Also we started to go cloud agnostic, it handles both AWS and Azure. Do you something that does AWS, Azure _and_ Google Cloud also?

评论 #17037021 未加载

nikolayabout 7 years ago

How is this better than Singer [0]?[0]: <a href="https://www.singer.io/" rel="nofollow">https://www.singer.io/</a>

mistrial9about 7 years ago

the docs animations inline add to casual-browsing fun <a href="https://github.com/mara/mara-example-project" rel="nofollow">https://github.com/mara/mara-example-project</a>

pgwhalenabout 7 years ago

Is anyone out there doing (or considering) streaming ETL as opposed to batch?

评论 #17031265 未加载

评论 #17031126 未加载

评论 #17032465 未加载

jmartricanabout 7 years ago

Does not seem lightweight. Maybe its light compared to the other solutions?

foolinaroundabout 7 years ago

if this had a scheduler and an alerting mechanism when SLAs are breached, that would be great! maybe future features?

评论 #17032585 未加载

larsfabout 7 years ago

Composable is amazing at ETL - <a href="https://composableanalytics.com" rel="nofollow">https://composableanalytics.com</a>It blows things like Alteryx, NiFi, Airflow out of the water.

评论 #17031877 未加载

25 comments

cosmieabout 7 years ago

评论 #17031332 未加载

评论 #17031373 未加载

评论 #17032443 未加载

评论 #17030899 未加载

评论 #17033107 未加载

评论 #17031740 未加载

antoncohenabout 7 years ago

评论 #17031487 未加载

评论 #17033223 未加载

评论 #17031422 未加载

评论 #17035954 未加载

评论 #17032754 未加载

评论 #17031427 未加载

notamyabout 7 years ago

What exactly IS an "ETL framework"? I looked at both this project and Apache Airflow, and I'm not quite sure I understand...

评论 #17030407 未加载

评论 #17030389 未加载

评论 #17030382 未加载

评论 #17032402 未加载

评论 #17031076 未加载

评论 #17033264 未加载

评论 #17030916 未加载

porkerabout 7 years ago

评论 #17031341 未加载

评论 #17032448 未加载

评论 #17036184 未加载

groodtabout 7 years ago

评论 #17034057 未加载

stadeschuldtabout 7 years ago

Here is a conference talk presenting the framework and the ideas behind: <a href="https://youtu.be/GdtFuOah-5c" rel="nofollow">https://youtu.be/GdtFuOah-5c</a>

samuellabout 7 years ago

Send a pull request to add it to <a href="https://github.com/pditommaso/awesome-pipeline/blob/master/README.md" rel="nofollow">https://github.com/pditommaso/awesome-pipeline/blob/master/R...</a>

评论 #17031179 未加载

occams_chainsawabout 7 years ago

Seems interesting, but I'll have to dismiss it for now due the total lack of tests

endlessvoid94about 7 years ago

评论 #17032615 未加载

thalesmelloabout 7 years ago

评论 #17032382 未加载

mooredsabout 7 years ago

I'd love this, except for ruby. Anyone know of anything like this?

评论 #17032554 未加载

评论 #17032238 未加载

评论 #17030790 未加载

shadowmintabout 7 years ago

...but why?Just use airflow.Things I want in an ETL:[x] works at scale.[x] simple to use.[x] not written in python (eg. in go or rust)[x] easy to scale (eg. in docker)[ ] this.

评论 #17030620 未加载

评论 #17030450 未加载

评论 #17030562 未加载

评论 #17030854 未加载

评论 #17030358 未加载

评论 #17035242 未加载

评论 #17030982 未加载

评论 #17034671 未加载

mariusaeabout 7 years ago

leblancfgabout 7 years ago

评论 #17031061 未加载

评论 #17030807 未加载

frugalmailabout 7 years ago

reinhardtabout 7 years ago

Why PostgreSQL only? The mara-DB dependency [1] claims to support more.[1] <a href="https://github.com/mara/mara-db" rel="nofollow">https://github.com/mara/mara-db</a>

评论 #17037005 未加载

samuellabout 7 years ago

I really like the interactive terminal-based menu, that's neat!

tjr225about 7 years ago

This is slightly weird...but I named my dog Mara: <a href="https://flic.kr/p/FtdRNX" rel="nofollow">https://flic.kr/p/FtdRNX</a>

评论 #17037028 未加载

soobrosaabout 7 years ago

评论 #17037021 未加载

nikolayabout 7 years ago

How is this better than Singer [0]?[0]: <a href="https://www.singer.io/" rel="nofollow">https://www.singer.io/</a>

mistrial9about 7 years ago

the docs animations inline add to casual-browsing fun <a href="https://github.com/mara/mara-example-project" rel="nofollow">https://github.com/mara/mara-example-project</a>

pgwhalenabout 7 years ago

Is anyone out there doing (or considering) streaming ETL as opposed to batch?

评论 #17031265 未加载

评论 #17031126 未加载

评论 #17032465 未加载

jmartricanabout 7 years ago

Does not seem lightweight. Maybe its light compared to the other solutions?

foolinaroundabout 7 years ago

if this had a scheduler and an alerting mechanism when SLAs are breached, that would be great! maybe future features?

评论 #17032585 未加载

larsfabout 7 years ago

Composable is amazing at ETL - <a href="https://composableanalytics.com" rel="nofollow">https://composableanalytics.com</a>It blows things like Alteryx, NiFi, Airflow out of the water.

评论 #17031877 未加载